In the context of marketing analytics, customer segmentation has a vital role to play in optimizing ROI.
The dataset contains the following features:
Note: You can assume that the data is collected in the year 2016.
#Importing necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# To scale the data using z-score
from sklearn.preprocessing import StandardScaler
# Importing PCA and t-SNE
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
%matplotlib inline
# Importing clustering algorithms
!pip install scikit-learn-extra
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn_extra.cluster import KMedoids
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import DBSCAN
# Silhouette score
from sklearn.metrics import silhouette_score
import warnings
warnings.filterwarnings("ignore")
from scipy.stats import chi2_contingency
Requirement already satisfied: scikit-learn-extra in /usr/local/lib/python3.10/dist-packages (0.3.0) Requirement already satisfied: numpy>=1.13.3 in /usr/local/lib/python3.10/dist-packages (from scikit-learn-extra) (1.25.2) Requirement already satisfied: scipy>=0.19.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn-extra) (1.11.4) Requirement already satisfied: scikit-learn>=0.23.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn-extra) (1.2.2) Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.23.0->scikit-learn-extra) (1.4.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.23.0->scikit-learn-extra) (3.5.0)
#Mounting Google colab
from google.colab import drive
drive.mount("/content/drive")
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
#Loading the data
df = pd.read_csv('/content/drive/MyDrive/MIT/Capstone project/marketing_campaign+%284%29.csv')
#Reading the dataset by looking at the first 5 rows
df.head()
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 5524 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 04-09-2012 | 58 | 635 | ... | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 2174 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 08-03-2014 | 38 | 11 | ... | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 4141 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 21-08-2013 | 26 | 426 | ... | 2 | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 6182 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 10-02-2014 | 26 | 11 | ... | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5324 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 19-01-2014 | 94 | 173 | ... | 3 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 27 columns
#Understanding the shape of the dataset
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2240 entries, 0 to 2239 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2240 non-null int64 1 Year_Birth 2240 non-null int64 2 Education 2240 non-null object 3 Marital_Status 2240 non-null object 4 Income 2216 non-null float64 5 Kidhome 2240 non-null int64 6 Teenhome 2240 non-null int64 7 Dt_Customer 2240 non-null object 8 Recency 2240 non-null int64 9 MntWines 2240 non-null int64 10 MntFruits 2240 non-null int64 11 MntMeatProducts 2240 non-null int64 12 MntFishProducts 2240 non-null int64 13 MntSweetProducts 2240 non-null int64 14 MntGoldProds 2240 non-null int64 15 NumDealsPurchases 2240 non-null int64 16 NumWebPurchases 2240 non-null int64 17 NumCatalogPurchases 2240 non-null int64 18 NumStorePurchases 2240 non-null int64 19 NumWebVisitsMonth 2240 non-null int64 20 AcceptedCmp3 2240 non-null int64 21 AcceptedCmp4 2240 non-null int64 22 AcceptedCmp5 2240 non-null int64 23 AcceptedCmp1 2240 non-null int64 24 AcceptedCmp2 2240 non-null int64 25 Complain 2240 non-null int64 26 Response 2240 non-null int64 dtypes: float64(1), int64(23), object(3) memory usage: 472.6+ KB
df.shape
(2240, 27)
There are 2240 rows and 27 columns. All data are integers or float except Dt_Customer, Marital Status and Education.
Income has 24 missing values.
There are no other missing values.
Dt_Customer should be a date type.
#Confirming missing values for Income
df['Income'].isnull().sum()
24
#Looking at how these missing data are reported in the dataset
df[df['Income'].isnull()]
| ID | Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10 | 1994 | 1983 | Graduation | Married | NaN | 1 | 0 | 15-11-2013 | 11 | 5 | ... | 0 | 2 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 27 | 5255 | 1986 | Graduation | Single | NaN | 1 | 0 | 20-02-2013 | 19 | 5 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 43 | 7281 | 1959 | PhD | Single | NaN | 0 | 0 | 05-11-2013 | 80 | 81 | ... | 3 | 4 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 48 | 7244 | 1951 | Graduation | Single | NaN | 2 | 1 | 01-01-2014 | 96 | 48 | ... | 1 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 58 | 8557 | 1982 | Graduation | Single | NaN | 1 | 0 | 17-06-2013 | 57 | 11 | ... | 0 | 3 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 71 | 10629 | 1973 | 2n Cycle | Married | NaN | 1 | 0 | 14-09-2012 | 25 | 25 | ... | 0 | 3 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 90 | 8996 | 1957 | PhD | Married | NaN | 2 | 1 | 19-11-2012 | 4 | 230 | ... | 2 | 8 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 91 | 9235 | 1957 | Graduation | Single | NaN | 1 | 1 | 27-05-2014 | 45 | 7 | ... | 0 | 2 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 92 | 5798 | 1973 | Master | Together | NaN | 0 | 0 | 23-11-2013 | 87 | 445 | ... | 4 | 8 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 128 | 8268 | 1961 | PhD | Married | NaN | 0 | 1 | 11-07-2013 | 23 | 352 | ... | 1 | 7 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 133 | 1295 | 1963 | Graduation | Married | NaN | 0 | 1 | 11-08-2013 | 96 | 231 | ... | 5 | 7 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 312 | 2437 | 1989 | Graduation | Married | NaN | 0 | 0 | 03-06-2013 | 69 | 861 | ... | 5 | 12 | 3 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
| 319 | 2863 | 1970 | Graduation | Single | NaN | 1 | 2 | 23-08-2013 | 67 | 738 | ... | 3 | 10 | 7 | 0 | 1 | 0 | 1 | 0 | 0 | 0 |
| 1379 | 10475 | 1970 | Master | Together | NaN | 0 | 1 | 01-04-2013 | 39 | 187 | ... | 2 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1382 | 2902 | 1958 | Graduation | Together | NaN | 1 | 1 | 03-09-2012 | 87 | 19 | ... | 0 | 3 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1383 | 4345 | 1964 | 2n Cycle | Single | NaN | 1 | 1 | 12-01-2014 | 49 | 5 | ... | 0 | 2 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1386 | 3769 | 1972 | PhD | Together | NaN | 1 | 0 | 02-03-2014 | 17 | 25 | ... | 0 | 3 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2059 | 7187 | 1969 | Master | Together | NaN | 1 | 1 | 18-05-2013 | 52 | 375 | ... | 10 | 4 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2061 | 1612 | 1981 | PhD | Single | NaN | 1 | 0 | 31-05-2013 | 82 | 23 | ... | 0 | 3 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2078 | 5079 | 1971 | Graduation | Married | NaN | 1 | 1 | 03-03-2013 | 82 | 71 | ... | 1 | 3 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2079 | 10339 | 1954 | Master | Together | NaN | 0 | 1 | 23-06-2013 | 83 | 161 | ... | 1 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2081 | 3117 | 1955 | Graduation | Single | NaN | 0 | 1 | 18-10-2013 | 95 | 264 | ... | 1 | 5 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2084 | 5250 | 1943 | Master | Widow | NaN | 0 | 0 | 30-10-2013 | 75 | 532 | ... | 5 | 11 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 2228 | 8720 | 1978 | 2n Cycle | Together | NaN | 0 | 0 | 12-08-2012 | 53 | 32 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
24 rows × 27 columns
These 24 rows represent 1% of all data. So I do not expect to introduce any bias by removing these rows.
#before working on the data and modifying it, I will create the new variable df_copy
df_copy = df.copy(deep=True)
#Remove these rows
df_copy['Income']= df_copy['Income'].astype(float)
df_copy.dropna(inplace = True)
df_copy.info()
<class 'pandas.core.frame.DataFrame'> Index: 2216 entries, 0 to 2239 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 2216 non-null int64 1 Year_Birth 2216 non-null int64 2 Education 2216 non-null object 3 Marital_Status 2216 non-null object 4 Income 2216 non-null float64 5 Kidhome 2216 non-null int64 6 Teenhome 2216 non-null int64 7 Dt_Customer 2216 non-null object 8 Recency 2216 non-null int64 9 MntWines 2216 non-null int64 10 MntFruits 2216 non-null int64 11 MntMeatProducts 2216 non-null int64 12 MntFishProducts 2216 non-null int64 13 MntSweetProducts 2216 non-null int64 14 MntGoldProds 2216 non-null int64 15 NumDealsPurchases 2216 non-null int64 16 NumWebPurchases 2216 non-null int64 17 NumCatalogPurchases 2216 non-null int64 18 NumStorePurchases 2216 non-null int64 19 NumWebVisitsMonth 2216 non-null int64 20 AcceptedCmp3 2216 non-null int64 21 AcceptedCmp4 2216 non-null int64 22 AcceptedCmp5 2216 non-null int64 23 AcceptedCmp1 2216 non-null int64 24 AcceptedCmp2 2216 non-null int64 25 Complain 2216 non-null int64 26 Response 2216 non-null int64 dtypes: float64(1), int64(23), object(3) memory usage: 484.8+ KB
# Change Dt_Customer from object to date time type
df_copy['Dt_Customer'] = pd.to_datetime(df_copy['Dt_Customer'], dayfirst= True)
df_copy['Dt_Customer'].head()
0 2012-09-04 1 2014-03-08 2 2013-08-21 3 2014-02-10 4 2014-01-19 Name: Dt_Customer, dtype: datetime64[ns]
Dt_Customer is now a datetime format
#Removing columns that are not useful for the analysis
df_copy.drop(columns = 'ID', axis = 1, inplace = True)
df_copy.head()
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 2012-09-04 | 58 | 635 | 88 | ... | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 2014-03-08 | 38 | 11 | 1 | ... | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 2013-08-21 | 26 | 426 | 49 | ... | 2 | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 2014-02-10 | 26 | 11 | 4 | ... | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 2014-01-19 | 94 | 173 | 43 | ... | 3 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 26 columns
#Looking for duplicates
df_copy[df_copy.duplicated()]
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 83 | 1963 | Master | Together | 38620.0 | 0 | 0 | 2013-05-11 | 56 | 112 | 17 | ... | 5 | 3 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 179 | 1951 | 2n Cycle | Married | 78497.0 | 0 | 0 | 2013-12-01 | 44 | 207 | 26 | ... | 7 | 12 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 281 | 1976 | Graduation | Together | 51369.0 | 0 | 1 | 2012-10-25 | 84 | 297 | 7 | ... | 2 | 4 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 282 | 1946 | Graduation | Together | 37760.0 | 0 | 0 | 2012-08-31 | 20 | 84 | 5 | ... | 1 | 6 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 363 | 1978 | PhD | Married | 37717.0 | 1 | 0 | 2012-11-23 | 31 | 9 | 0 | ... | 0 | 2 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2218 | 1978 | Graduation | Together | 22775.0 | 1 | 0 | 2013-06-19 | 40 | 5 | 1 | ... | 0 | 2 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2221 | 1982 | Master | Single | 75777.0 | 0 | 0 | 2013-07-04 | 12 | 712 | 26 | ... | 6 | 11 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 1 |
| 2225 | 1968 | Graduation | Together | 58554.0 | 1 | 1 | 2012-09-26 | 55 | 368 | 24 | ... | 2 | 6 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2234 | 1974 | Graduation | Married | 34421.0 | 1 | 0 | 2013-07-01 | 81 | 3 | 3 | ... | 0 | 2 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2236 | 1946 | PhD | Together | 64014.0 | 2 | 1 | 2014-06-10 | 56 | 406 | 0 | ... | 2 | 5 | 7 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
182 rows × 26 columns
There are 182 duplicated rows. We have to remove duplicates
# Removing duplicated rows
df_copy.drop_duplicates( keep='first', inplace=True, ignore_index=False)
df_copy.head()
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 2012-09-04 | 58 | 635 | 88 | ... | 10 | 4 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 2014-03-08 | 38 | 11 | 1 | ... | 1 | 2 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 2013-08-21 | 26 | 426 | 49 | ... | 2 | 10 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 2014-02-10 | 26 | 11 | 4 | ... | 0 | 4 | 6 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 2014-01-19 | 94 | 173 | 43 | ... | 3 | 6 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 rows × 26 columns
#look at the final shape and data type before the analysis
df_copy.info()
<class 'pandas.core.frame.DataFrame'> Index: 2034 entries, 0 to 2239 Data columns (total 26 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year_Birth 2034 non-null int64 1 Education 2034 non-null object 2 Marital_Status 2034 non-null object 3 Income 2034 non-null float64 4 Kidhome 2034 non-null int64 5 Teenhome 2034 non-null int64 6 Dt_Customer 2034 non-null datetime64[ns] 7 Recency 2034 non-null int64 8 MntWines 2034 non-null int64 9 MntFruits 2034 non-null int64 10 MntMeatProducts 2034 non-null int64 11 MntFishProducts 2034 non-null int64 12 MntSweetProducts 2034 non-null int64 13 MntGoldProds 2034 non-null int64 14 NumDealsPurchases 2034 non-null int64 15 NumWebPurchases 2034 non-null int64 16 NumCatalogPurchases 2034 non-null int64 17 NumStorePurchases 2034 non-null int64 18 NumWebVisitsMonth 2034 non-null int64 19 AcceptedCmp3 2034 non-null int64 20 AcceptedCmp4 2034 non-null int64 21 AcceptedCmp5 2034 non-null int64 22 AcceptedCmp1 2034 non-null int64 23 AcceptedCmp2 2034 non-null int64 24 Complain 2034 non-null int64 25 Response 2034 non-null int64 dtypes: datetime64[ns](1), float64(1), int64(22), object(2) memory usage: 429.0+ KB
There are 2034 rows, with 26 columns. Datatypes are integers, floats (income and year of birth).
Object data are Marital Status and Education. There are no more missing values or duplicated rows.
Questions:
#Summary statistics of the numerical data
df_copy.describe().T
| count | mean | min | 25% | 50% | 75% | max | std | |
|---|---|---|---|---|---|---|---|---|
| Year_Birth | 2034.0 | 1968.802852 | 1893.0 | 1959.0 | 1970.0 | 1977.0 | 1996.0 | 11.97542 |
| Income | 2034.0 | 52357.791544 | 1730.0 | 35528.25 | 51533.0 | 68480.75 | 666666.0 | 25526.956988 |
| Kidhome | 2034.0 | 0.44297 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 0.535914 |
| Teenhome | 2034.0 | 0.508358 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 0.54613 |
| Dt_Customer | 2034 | 2013-07-11 14:44:57.345132544 | 2012-07-30 00:00:00 | 2013-01-17 00:00:00 | 2013-07-12 12:00:00 | 2014-01-02 00:00:00 | 2014-06-29 00:00:00 | NaN |
| Recency | 2034.0 | 48.844641 | 0.0 | 24.0 | 49.0 | 74.0 | 99.0 | 28.983678 |
| MntWines | 2034.0 | 305.180924 | 0.0 | 23.0 | 175.5 | 505.0 | 1493.0 | 337.323274 |
| MntFruits | 2034.0 | 26.292527 | 0.0 | 2.0 | 8.0 | 33.0 | 199.0 | 39.772885 |
| MntMeatProducts | 2034.0 | 167.76352 | 0.0 | 16.0 | 68.0 | 230.0 | 1725.0 | 226.790697 |
| MntFishProducts | 2034.0 | 37.449361 | 0.0 | 3.0 | 12.0 | 50.0 | 259.0 | 54.792015 |
| MntSweetProducts | 2034.0 | 27.141101 | 0.0 | 1.0 | 8.0 | 33.75 | 262.0 | 41.49443 |
| MntGoldProds | 2034.0 | 43.809735 | 0.0 | 9.0 | 24.0 | 56.0 | 321.0 | 51.741969 |
| NumDealsPurchases | 2034.0 | 2.337758 | 0.0 | 1.0 | 2.0 | 3.0 | 15.0 | 1.933849 |
| NumWebPurchases | 2034.0 | 4.099312 | 0.0 | 2.0 | 4.0 | 6.0 | 27.0 | 2.756911 |
| NumCatalogPurchases | 2034.0 | 2.6706 | 0.0 | 0.0 | 2.0 | 4.0 | 28.0 | 2.937896 |
| NumStorePurchases | 2034.0 | 5.782203 | 0.0 | 3.0 | 5.0 | 8.0 | 13.0 | 3.238853 |
| NumWebVisitsMonth | 2034.0 | 5.322517 | 0.0 | 3.0 | 6.0 | 7.0 | 20.0 | 2.438665 |
| AcceptedCmp3 | 2034.0 | 0.074238 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.262222 |
| AcceptedCmp4 | 2034.0 | 0.076205 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.26539 |
| AcceptedCmp5 | 2034.0 | 0.072271 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.259 |
| AcceptedCmp1 | 2034.0 | 0.065388 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.247271 |
| AcceptedCmp2 | 2034.0 | 0.012291 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.110209 |
| Complain | 2034.0 | 0.009833 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.098696 |
| Response | 2034.0 | 0.153392 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.360454 |
Year of birth: There are some suspicious data such as the min year 1893. Such an outlier will have to be looked at on boxplots. The birth years range from 1893 to 1996, with a mean birth year of approximately 1968.
Income: The income ranges from 1730 to 666666, mean value is 52357.8, which is not far from the median. However the max income is 666666.0 with a SD 25526.96, which is quite high. it seems there are high incomes.
Kidhome and Teenhome: The number of kids at home ranges from 0 to 2, with a mean of 0.44. Teenhome: The number of teenagers at home ranges from 0 to 2, with a mean of 0.51. People have maximum 2 kids/teens at home
DL customer: Most people have joined in 07/2013. Most recent is 06/2014, oldest is 06/2012.
Recency: Recency: The recency of the last purchase ranges from 0 to 99 days,Last purchase was 49 days before (mean and median)
For all products: There are big variations in the amount of products bought with high SD, which are higher than the median:
MntWines: The amount spent on wine ranges from 0 to 1493, with a mean of 303.94
MntFruits: The amount spent on fruits ranges from 0 to 199, with a mean of 26.30.
MntMeatProducts: The amount spent on meat products ranges from 0 to 1725, with a mean of 166.95.
MntFishProducts: The amount spent on fish products ranges from 0 to 259, with a mean of 37.53.
MntSweetProducts: The amount spent on sweet products ranges from 0 to 263, with a mean of 27.06.
MntGoldProds: The amount spent on gold products ranges from 0 to 362, with a mean of 44.02.
Purchases behaviors::
NumDealsPurchases: The number of deals purchases ranges from 0 to 15, with a mean of 2.33.
NumWebPurchases: The number of web purchases ranges from 0 to 27, with a mean of 4.08.
NumCatalogPurchases: The number of catalog purchases ranges from 0 to 28, with a mean of 2.66.
NumStorePurchases: The number of store purchases ranges from 0 to 13, with a mean of 5.79.
Visite per month: The number of web visits per month ranges from 0 to 20, with a mean of 5.32.
Complain: The number of complaints ranges from 0 to 1, with a mean of 0.01.Most people did not put a complain
Accepted offers: Behave as categorical variables
AcceptedCmp3: The number of accepted campaign 3 offers ranges from 0 to 1, with a mean of 0.07.
AcceptedCmp4: The number of accepted campaign 4 offers ranges from 0 to 1, with a mean of 0.07.
AcceptedCmp5: The number of accepted campaign 5 offers ranges from 0 to 1, with a mean of 0.07.
AcceptedCmp1: The number of accepted campaign 1 offers ranges from 0 to 1, with a mean of 0.06.
AcceptedCmp2: The number of accepted campaign 2 offers ranges from 0 to 1, with a mean of 0.01.
Response: The number of responses ranges from 0 to 1, with a mean of 0.15.
offers= ['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response']
for col in offers:
print('Number of {} '.format (col), 'is', df_copy[col].sum())
Number of AcceptedCmp1 is 133 Number of AcceptedCmp2 is 25 Number of AcceptedCmp3 is 151 Number of AcceptedCmp4 is 155 Number of AcceptedCmp5 is 147 Number of Response is 312
# Statistical summary of categorical data
df_copy.describe(include=[object]).T
| count | unique | top | freq | |
|---|---|---|---|---|
| Education | 2034 | 5 | Graduation | 1019 |
| Marital_Status | 2034 | 8 | Married | 788 |
THere are 5 levels of education, the most frequent one is Graduation.
There are 8 items for Marital Status which might be high. Most frequent one is Married
#Unique values in categorical data: education
df_copy.Education.unique()
array(['Graduation', 'PhD', 'Master', 'Basic', '2n Cycle'], dtype=object)
# COunting how many people are in each category of Education
df_copy.Education.value_counts()
Education Graduation 1019 PhD 445 Master 336 2n Cycle 185 Basic 49 Name: count, dtype: int64
As 2n Cycle also refers to Master degree in Europe. So we will replace '2n Cycle' by 'Master'
#Replace 2n Cycle by Master
df_copy['Education'].replace (to_replace = '2n Cycle', value = 'Master', inplace= True)
df_copy.Education.value_counts()
Education Graduation 1019 Master 521 PhD 445 Basic 49 Name: count, dtype: int64
2n cycle has now disappear and Master has now 365 + 200 values
#Unique values in categorical data: Marital Status
df_copy['Marital_Status'].unique()
array(['Single', 'Together', 'Married', 'Divorced', 'Widow', 'Alone',
'Absurd', 'YOLO'], dtype=object)
Absurd and YOLO do not answer the question and should be removed
#Count how many people are in each category
df_copy['Marital_Status'].value_counts()
Marital_Status Married 788 Together 514 Single 439 Divorced 216 Widow 70 Alone 3 Absurd 2 YOLO 2 Name: count, dtype: int64
#removing Absurd and Yolo from Marital_Status
remove_list = ['Absurd', 'YOLO']
df_copy = df_copy[~df_copy['Marital_Status'].isin(remove_list)]
df_copy['Marital_Status'].value_counts()
Marital_Status Married 788 Together 514 Single 439 Divorced 216 Widow 70 Alone 3 Name: count, dtype: int64
Absurd and Yolo are not proper answers, so they will be removed from the dataset. Single and alone can be both merged under Single.
Absurd and YOLO have been removed from the list
#Replace 'Alone' by 'Single'
df_copy['Marital_Status'].replace( to_replace = 'Alone', value= 'Single', inplace= True)
df_copy['Marital_Status'].value_counts()
Marital_Status Married 788 Together 514 Single 442 Divorced 216 Widow 70 Name: count, dtype: int64
#Unique values in categorical data: Kidhome
df_copy['Kidhome'].value_counts()
Kidhome 0 1170 1 819 2 41 Name: count, dtype: int64
Most people do not have kid at home.This is consistent with the age of the population. 819 have 1 child at home and 41 have 2.
#Unique values in categorical data: Teenhome
df_copy['Teenhome'].value_counts()
Teenhome 0 1047 1 934 2 49 Name: count, dtype: int64
Most people do not have Teen at home. A bit more people have a Teen compared to kid at home, also consistent with the age of the population
#Unique values in categorical data: Complain
df_copy['Complain'].value_counts()
Complain 0 2010 1 20 Name: count, dtype: int64
Only 20 people have put a complain
Univariate analysis is used to explore each variable in a data set, separately. It looks at the range of values, as well as the central tendency of the values. It can be done for both numerical and categorical variables.
Leading Questions:
Explore and visualise numerical variables
#Liste of columns with numerical values
df_copy.columns
Index(['Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
'AcceptedCmp2', 'Complain', 'Response'],
dtype='object')
# work on a new dataframe df_n for numerical values for univariate analysis
df_n = df_copy.copy(deep = True)
#Remove categorical variables
df_n.drop (columns = ['Education', 'Marital_Status', 'Kidhome', 'Teenhome','Complain','Response', 'AcceptedCmp1', 'AcceptedCmp2','AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Dt_Customer'], axis = 1, inplace = True)
df_n.head()
| Year_Birth | Income | Recency | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1957 | 58138.0 | 58 | 635 | 88 | 546 | 172 | 88 | 88 | 3 | 8 | 10 | 4 | 7 |
| 1 | 1954 | 46344.0 | 38 | 11 | 1 | 6 | 2 | 1 | 6 | 2 | 1 | 1 | 2 | 5 |
| 2 | 1965 | 71613.0 | 26 | 426 | 49 | 127 | 111 | 21 | 42 | 1 | 8 | 2 | 10 | 4 |
| 3 | 1984 | 26646.0 | 26 | 11 | 4 | 20 | 10 | 3 | 5 | 2 | 2 | 0 | 4 | 6 |
| 4 | 1981 | 58293.0 | 94 | 173 | 43 | 118 | 46 | 27 | 15 | 5 | 5 | 3 | 6 | 5 |
#univariate data analysis for numerical values
#creating histograms and box plots for each variable
j=0
for i in df_n:
fig, ax = plt.subplots(1,2, figsize = (15,4))
plt.subplot(1,2,1)
print(i)
print('Skew :', round(df_n[i].skew(), 2))
sns.set_style("whitegrid")
sns.histplot( x = df_n[i], data = df_n, kde= True)
plt.axvline (x=df_n[i].mean(), color='g', linestyle ='--')
plt.title(df_n.columns[j])
plt.subplot(1,2,2)
sns.boxplot( x = df_n[i], data = df_n)
plt.title(df_n.columns[j])
j+=1
plt.show()
Year_Birth Skew : -0.38
Income Skew : 7.06
Recency Skew : 0.01
MntWines Skew : 1.17
MntFruits Skew : 2.11
MntMeatProducts Skew : 2.05
MntFishProducts Skew : 1.93
MntSweetProducts Skew : 2.12
MntGoldProds Skew : 1.85
NumDealsPurchases Skew : 2.44
NumWebPurchases Skew : 1.23
NumCatalogPurchases Skew : 1.96
NumStorePurchases Skew : 0.7
NumWebVisitsMonth Skew : 0.27
Year of birth: There are some outliers, born before 1900, which is suspicious. Bimodal distribution: one mode around 1955 and the other around 1975.
Income has lost of outliers. Normal distribution.
Recency:Uniform distribution. Average of 150 people have visited with a mean of 49 days recency
All products:Have a right skewed distribution with lots of outliers, in the high numbers.
All types of purchases have a right skewed distribution with outliers, except store purchases.
NUmber of store visits: right skewed distribution with outliers.
#Remove ouliers in income
Q1 = df_n['Income'].quantile(0.25)
Q3 = df_n['Income'].quantile(0.75)
IQR = Q3 - Q1
upper_bound = df_n['Income'].quantile(0.75) + 1.5 *IQR
df_n['Income'] = np.where((df_n['Income'] > upper_bound), upper_bound, df_n['Income'])
sns.boxplot(df_n['Income'])
plt.show()
#Remove outliers in 'MntWines', 'MntFruits','MntMeatProducts', 'MntFishProducts', 'MntSweetProducts','MntGoldProds'
liste = ['MntWines', 'MntFruits','MntMeatProducts', 'MntFishProducts', 'MntSweetProducts','MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
'NumCatalogPurchases', 'NumWebVisitsMonth']
for item in liste:
Q1 = df_n[item].quantile(0.25)
Q3 = df_n[item].quantile(0.75)
IQR = Q3 - Q1
upper_bound = df_n[item].quantile(0.75) + 1.5 *IQR
df_n[item] = np.where((df_n[item] > upper_bound), upper_bound, df_n[item])
figure=plt.figure(figsize = (5,4))
sns.boxplot(df_n[item])
plt.show()
#Remove rows where year of birth is <1900
condition =df_n['Year_Birth'] <=1900
df_n = df_n.drop(df_n[condition].index)
# Apply all these rules to df_copy
Q1 = df_n['Income'].quantile(0.25)
Q3 = df_n['Income'].quantile(0.75)
IQR = Q3 - Q1
upper_bound = df_n['Income'].quantile(0.75) + 1.5 *IQR
df_copy['Income'] = np.where((df_copy['Income'] > upper_bound), upper_bound, df_copy['Income'])
liste = ['MntWines', 'MntFruits','MntMeatProducts', 'MntFishProducts', 'MntSweetProducts','MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
'NumCatalogPurchases', 'NumWebVisitsMonth']
for item in liste:
Q1 = df_copy[item].quantile(0.25)
Q3 = df_copy[item].quantile(0.75)
IQR = Q3 - Q1
upper_bound = df_copy[item].quantile(0.75) + 1.5 *IQR
df_copy[item] = np.where((df_copy[item] > upper_bound), upper_bound, df_copy[item])
#Remove rows where year of birth is <1900
condition =df_copy['Year_Birth'] <=1900
df_copy = df_copy.drop(df_copy[condition].index)
Explore and visualize categorical variables
#Separate univariate analysis for cmp1, cmp2, cmp3, cmp4, cmp5 and Response.
df_cmp = df_copy[['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response']]
df_cmp.head()
| AcceptedCmp1 | AcceptedCmp2 | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | Response | |
|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 |
df_cmp.sum()
AcceptedCmp1 132 AcceptedCmp2 25 AcceptedCmp3 151 AcceptedCmp4 155 AcceptedCmp5 145 Response 310 dtype: int64
# Create a dataframe for visualization
df_cmp_tot = pd.DataFrame({'cmp':['AcceptedCmp1', 'AcceptedCmp2','AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response'], 'values': [132, 25, 151, 155, 145, 310]})
df_cmp_tot
| cmp | values | |
|---|---|---|
| 0 | AcceptedCmp1 | 132 |
| 1 | AcceptedCmp2 | 25 |
| 2 | AcceptedCmp3 | 151 |
| 3 | AcceptedCmp4 | 155 |
| 4 | AcceptedCmp5 | 145 |
| 5 | Response | 310 |
#create graph
print('AcceptedCmp 1, 2, 3, 4, 5 and last')
sns.barplot(x = 'cmp', y = 'values', data = df_cmp_tot, palette = 'Set1', hue = 'cmp')
plt.xticks(rotation = 60)
plt.show()
AcceptedCmp 1, 2, 3, 4, 5 and last
Most people have accepted the last offers. Very few accepted the second one
#Explore complains
df_copy['Complain'].value_counts()
Complain 0 2008 1 19 Name: count, dtype: int64
There have been 19 complains, 1%. So they will probably be dropped for the analysis
# Explore Education
df_copy['Education'].value_counts()
Education Graduation 1018 Master 518 PhD 442 Basic 49 Name: count, dtype: int64
# Explore Education: graph
order = ['Basic', 'Graduation', 'Master', 'PhD']
sns.countplot(df_copy['Education'], palette = 'Set1', order = order)
<Axes: xlabel='count', ylabel='Education'>
#Explore Marital status
sns.countplot(df_copy['Marital_Status'],palette = 'Set1')
<Axes: xlabel='count', ylabel='Marital_Status'>
#Explore kidhome
df_copy['Kidhome'].value_counts()
Kidhome 0 1168 1 818 2 41 Name: count, dtype: int64
sns.violinplot(df_copy['Kidhome'])
<Axes: ylabel='Kidhome'>
This graph shows well the spread of people having 0, 1 or 2 kids at home
#Explore Teenhome
df_copy['Teenhome'].value_counts()
Teenhome 0 1045 1 933 2 49 Name: count, dtype: int64
sns.violinplot(df_copy['Teenhome'])
<Axes: ylabel='Teenhome'>
This graph shows well the spread of people having 0, 1 or 2 Teens at home
# Explore Dt_Customer
df_copy['Dt_Customer'] = pd.to_datetime(df_copy['Dt_Customer'], format='%d-%m-%Y')
# Extract the year and month for plotting
df_copy['Year_Month'] = df_copy['Dt_Customer'].dt.to_period('M')
# Plot the count of customers by 'Dt_Customer'
plt.figure(figsize=(12, 4))
sns.countplot(x='Year_Month', data=df_copy, palette = 'Set1')
plt.xticks(rotation=90)
plt.title('Count of Customers by Dt_Customer')
plt.xlabel('Year-Month')
plt.ylabel('Count')
plt.show()
Bivariate analysis between numerical variables
sns.pairplot(df_n, corner= True)
Output hidden; open in https://colab.research.google.com to view.
#Multivariate analysis
# correlation between all variables: bivariate analysis for numerical values
fig = plt.figure(figsize=(17,5))
sns.heatmap(df_copy[['Year_Birth', 'Income', 'Kidhome',
'Teenhome', 'Recency', 'MntWines', 'MntFruits', 'MntMeatProducts',
'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases',
'NumStorePurchases', 'NumWebVisitsMonth', 'AcceptedCmp3',
'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2',
'Complain', 'Response']].corr(), annot = True, cmap='coolwarm', vmin=-1, vmax=1, annot_kws={'size': 8})
plt.xticks(rotation = 60)
plt.title('Heatmap for numerical values')
plt.show()
#Multivariate analysis
# correlation between all variables: bivariate analysis for numerical values, removing categorical variables, to focus on significant data
fig = plt.figure(figsize=(17,5))
sns.heatmap(df_n.corr(), annot = True, cmap='coolwarm', vmin=-1, vmax=1)
plt.xticks(rotation = 60)
plt.show()
Significant correlations:
Income is positively correlated with all products purchases, Web, Catalog and store purchases. It is negatively correlated with number of Web visits per month and deals purchases. Higher income seem to buy more with less visits online. When they visit, they probably buy. It is negatively correlated with year of birth: younger people earn less
Recencyis not correlated with any other variable.
All products purchases are positively correlated with income, other products purchases, Web, Catalog and store purchases, negatively correlated with number of Web visits per months, and year of birth. Meat and Wines pruchases are particularly lighly correlated
Deal purchases are positively correlated with number of web visits per month, Web purchases. It is negatively correlated with Income. It is not correlated with any specific product.
Web purchases are positively correlated with store and catalog purchases. It is also positively correlated with deals purchases and all types pf products purchases.It is positively correlated with income, and negatively correlated with year of birth (older people).
Catalog purchases: are positively correlated with store and Web purchases. It is also positively correlated with deals purchases and all types of products purchases.It is positively correlated with income, and negatively correlated with year of birth (older people)and number of web visits.
Store purchases: same profile as catalog purchases
Number of web visits per month are negatively correlated with store and catalog purchases, and posiitvely correlated with deals purchases. Negatively correlated with income, all type of products purchases.Positively correlated with year of birth. So younger people visit the we more often.
Bivariate analysis between categorical and numerical variables
#correlation between categorical variables and numerical variables
#liste for categorical values
list_cat = ['Education', 'Marital_Status', 'Kidhome', 'Teenhome','Complain']
#Dataframe for categorical values
df_cat = df_copy[list_cat]
df_cat.head()
| Education | Marital_Status | Kidhome | Teenhome | Complain | |
|---|---|---|---|---|---|
| 0 | Graduation | Single | 0 | 0 | 0 |
| 1 | Graduation | Single | 1 | 1 | 0 |
| 2 | Graduation | Together | 0 | 0 | 0 |
| 3 | Graduation | Together | 1 | 0 | 0 |
| 4 | PhD | Married | 1 | 0 | 0 |
columns_num=['Year_Birth','Income', 'Recency', 'MntWines', 'MntFruits', 'MntMeatProducts',
'MntFishProducts', 'MntSweetProducts', 'MntGoldProds',
'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases',
'NumStorePurchases', 'NumWebVisitsMonth']
columns_cat = ['Education', 'Marital_Status', 'Kidhome', 'Teenhome','Complain','Response', 'AcceptedCmp3',
'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2']
#figure = plt.figure(figsize = )
for num in columns_num:
fig, ax = plt.subplots(3,5, figsize= (15,10))
fig.suptitle('Correlation between {} and categorical variables'.format(num))
plt.subplots_adjust(left=0.1, right=0.9, top=0.9, bottom=0.1, wspace=0.4, hspace=0.5)
for i, cat in enumerate (columns_cat):
plt.subplot(3,5,i+1)
sns.boxplot(x=df_copy[cat], y= df_copy[num], data = df_copy, palette = 'Set1', hue =df_copy[cat], legend = False )
plt.xticks(rotation=60, fontsize = 8)
plt.show()
higher level of education is correlated with older people, higher income.No other correlation with all other numerical variables
Marital Status is not associated with any other numerical variables
No kid at home is associated with higher income, more purchases of all products, more catalog, store and web purchases. Less web visits.
Teenhome:younger people have no Teen at home. Older people have 1 or 2 teens at home. Maybe this could be grand children. More teens at home is correlated with more deals purchases.
No correlation for complains
Accepted 1rst offer: associated with higher income, all types of products purchases, more web, catalog and store purchases, less deals purchases and less web visits.
Similar profile for accepted 5th offer. Other offers are less significantly correlated with numerical variables
#Correlation between catergorical variables
#The most interesting one is between cmpvalues and the other categorical variables
col1 = ['Education', 'Marital_Status', 'Kidhome', 'Teenhome']
cmps= ['Response', 'AcceptedCmp3','AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2']
for i,col in enumerate(col1):
for cmp in cmps:
#Create a crosstab to count occurrences
crosstab = pd.crosstab(df_copy[cmp], df_copy[col])
# Perform Chi-Square Test of Independence
chi2, p, dof, expected = chi2_contingency(crosstab)
# Print the p-value
print('The p-value is', p)
# Plot the heatmap
plt.figure(figsize=(8, 4))
sns.heatmap(crosstab, annot=True, cmap='coolwarm', fmt='d')
plt.title('Heatmap of offers vs {}'.format(col))
plt.show()
The p-value is 0.0006882643736489826
The p-value is 0.7929217934720838
The p-value is 0.07044817148158672
The p-value is 0.22533866855409893
The p-value is 0.18419305464215485
The p-value is 0.28313668071216286
The p-value is 1.7710435205233267e-09
The p-value is 0.5981493284073702
The p-value is 0.2602413936046153
The p-value is 0.7251850536413036
The p-value is 0.5740442729004149
The p-value is 0.27748848752404426
The p-value is 0.0012181729836805268
The p-value is 0.26064149541943094
The p-value is 3.2836532900200857e-13
The p-value is 4.845384418252758e-21
The p-value is 2.5390823754361973e-17
The p-value is 0.0004827227439292986
The p-value is 9.966066901065126e-13
The p-value is 0.07164104547941785
The p-value is 0.1288177490081988
The p-value is 1.1512583969875174e-18
The p-value is 7.033087178903188e-11
The p-value is 0.34905093015180316
There is correlation between response to last offer and education, marital status and teenhome
There is correlation between response to the 5th offerand Kidhome and Teenhome
There is a correlation between response to the 2nd, and 4th offer and Kidhome
#Also look at complain and other categorical values
col1 = ['Education', 'Marital_Status', 'Kidhome', 'Teenhome']
for i,col in enumerate(col1):
#Create a crosstab to count occurrences
crosstab = pd.crosstab(df_cat['Complain'],df_cat[col])
# Perform Chi-Square Test of Independence
chi2, p, dof, expected = chi2_contingency(crosstab)
# Print the p-value
print('The p-value is', p)
# Plot the heatmap
plt.figure(figsize=(8, 4))
sns.heatmap(crosstab, annot=True, cmap='coolwarm', fmt='d')
plt.title('Heatmap of complain vs {}'.format(col))
plt.show()
The p-value is 0.24675241871574988
The p-value is 0.7252099715190983
The p-value is 0.2845912205020152
The p-value is 0.7008190509168353
No correlation for complain with other categorical variables
In this section, we will first prepare our dataset for analysis.
Think About It:
# Create new feature with age of each customer
now = pd.Timestamp('now')
df_copy['Customer_age'] = (2024 - df_copy['Year_Birth'])
df_copy['Customer_age'] = np.round(df_copy['Customer_age'],2)
df_copy.head()
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | NumWebVisitsMonth | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Year_Month | Customer_age | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 2012-09-04 | 58 | 635.0 | 79.5 | ... | 7.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2012-09 | 67 |
| 1 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 2014-03-08 | 38 | 11.0 | 1.0 | ... | 5.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2014-03 | 70 |
| 2 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 2013-08-21 | 26 | 426.0 | 49.0 | ... | 4.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2013-08 | 59 |
| 3 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 2014-02-10 | 26 | 11.0 | 4.0 | ... | 6.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2014-02 | 40 |
| 4 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 2014-01-19 | 94 | 173.0 | 43.0 | ... | 5.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2014-01 | 43 |
5 rows × 28 columns
#total kids and teens in the home
df_copy['Total_kids'] = df_copy['Kidhome'] + df_copy['Teenhome']
df_copy.head()
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | AcceptedCmp3 | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Year_Month | Customer_age | Total_kids | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 2012-09-04 | 58 | 635.0 | 79.5 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2012-09 | 67 | 0 |
| 1 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 2014-03-08 | 38 | 11.0 | 1.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2014-03 | 70 | 2 |
| 2 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 2013-08-21 | 26 | 426.0 | 49.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2013-08 | 59 | 0 |
| 3 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 2014-02-10 | 26 | 11.0 | 4.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2014-02 | 40 | 1 |
| 4 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 2014-01-19 | 94 | 173.0 | 43.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2014-01 | 43 | 1 |
5 rows × 29 columns
#Total number of family members each family has
#Numer of parents at home
df_copy['N-Parents'] = 0
for i,col in enumerate(df_copy['Marital_Status']):
if col == 'Together' or col =='Married':
df_copy['N-Parents'].iloc[i]= 2
else:
df_copy['N-Parents'].iloc[i]= 1
df_copy['Family_Size'] = df_copy['Total_kids'] + df_copy['N-Parents']
df_copy.drop(['N-Parents'], axis = 1, inplace = True)
df_copy.head()
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | AcceptedCmp4 | AcceptedCmp5 | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Year_Month | Customer_age | Total_kids | Family_Size | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 2012-09-04 | 58 | 635.0 | 79.5 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 2012-09 | 67 | 0 | 1 |
| 1 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 2014-03-08 | 38 | 11.0 | 1.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 2014-03 | 70 | 2 | 3 |
| 2 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 2013-08-21 | 26 | 426.0 | 49.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 2013-08 | 59 | 0 | 2 |
| 3 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 2014-02-10 | 26 | 11.0 | 4.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 2014-02 | 40 | 1 | 3 |
| 4 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 2014-01-19 | 94 | 173.0 | 43.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 2014-01 | 43 | 1 | 3 |
5 rows × 30 columns
# how long the customer has been with the company?
#already included in main data set as Time_since_enrol: was created when data type was sorted
df_copy['Dt_Customer'] = pd.to_datetime(df_copy['Dt_Customer'], dayfirst= True)
now = pd.Timestamp('now')
df_copy['Time_since_enrol'] = (now - df_copy['Dt_Customer']).dt.days/365.25
df_copy['Time_since_enrol'] = np.round(df_copy['Time_since_enrol'],2)
df_copy['Time_since_enrol'].head()
0 11.75 1 10.25 2 10.79 3 10.32 4 10.38 Name: Time_since_enrol, dtype: float64
# total amount spent by the customers on various products?
df_copy['Total_spendings'] = df_copy['MntFishProducts'] + df['MntFruits'] + df_copy['MntGoldProds'] +df_copy['MntMeatProducts'] +df_copy['MntSweetProducts'] + df_copy['MntWines']
df_copy.head()
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | AcceptedCmp1 | AcceptedCmp2 | Complain | Response | Year_Month | Customer_age | Total_kids | Family_Size | Time_since_enrol | Total_spendings | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 2012-09-04 | 58 | 635.0 | 79.5 | ... | 0 | 0 | 0 | 1 | 2012-09 | 67 | 0 | 1 | 11.75 | 1560.375 |
| 1 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 2014-03-08 | 38 | 11.0 | 1.0 | ... | 0 | 0 | 0 | 0 | 2014-03 | 70 | 2 | 3 | 10.25 | 27.000 |
| 2 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 2013-08-21 | 26 | 426.0 | 49.0 | ... | 0 | 0 | 0 | 0 | 2013-08 | 59 | 0 | 2 | 10.79 | 776.000 |
| 3 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 2014-02-10 | 26 | 11.0 | 4.0 | ... | 0 | 0 | 0 | 0 | 2014-02 | 40 | 1 | 3 | 10.32 | 53.000 |
| 4 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 2014-01-19 | 94 | 173.0 | 43.0 | ... | 0 | 0 | 0 | 0 | 2014-01 | 43 | 1 | 3 | 10.38 | 422.000 |
5 rows × 32 columns
#how many offers the customers have accepted
df_copy['Total_offers'] = df_copy['AcceptedCmp1'] +df_copy['AcceptedCmp2'] + df_copy['AcceptedCmp3'] + df_copy['AcceptedCmp4'] + df_copy['AcceptedCmp5'] + df_copy['Response']
df_copy.head()
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | AcceptedCmp2 | Complain | Response | Year_Month | Customer_age | Total_kids | Family_Size | Time_since_enrol | Total_spendings | Total_offers | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 2012-09-04 | 58 | 635.0 | 79.5 | ... | 0 | 0 | 1 | 2012-09 | 67 | 0 | 1 | 11.75 | 1560.375 | 1 |
| 1 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 2014-03-08 | 38 | 11.0 | 1.0 | ... | 0 | 0 | 0 | 2014-03 | 70 | 2 | 3 | 10.25 | 27.000 | 0 |
| 2 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 2013-08-21 | 26 | 426.0 | 49.0 | ... | 0 | 0 | 0 | 2013-08 | 59 | 0 | 2 | 10.79 | 776.000 | 0 |
| 3 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 2014-02-10 | 26 | 11.0 | 4.0 | ... | 0 | 0 | 0 | 2014-02 | 40 | 1 | 3 | 10.32 | 53.000 | 0 |
| 4 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 2014-01-19 | 94 | 173.0 | 43.0 | ... | 0 | 0 | 0 | 2014-01 | 43 | 1 | 3 | 10.38 | 422.000 | 0 |
5 rows × 33 columns
#amount spent per purchase
df_copy['Total_n_purchases'] = df_copy['NumCatalogPurchases'] + df_copy['NumWebPurchases'] + df_copy['NumStorePurchases'] + df_copy['NumDealsPurchases']
df_copy['Amount_per_purchase'] = df_copy['Total_spendings'] / df_copy['Total_n_purchases']
df_copy['Amount_per_purchase'] = np.round(df_copy['Amount_per_purchase'],2)
df_copy.drop('Total_n_purchases', axis = 1, inplace = True)
df_copy.head()
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | Complain | Response | Year_Month | Customer_age | Total_kids | Family_Size | Time_since_enrol | Total_spendings | Total_offers | Amount_per_purchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 2012-09-04 | 58 | 635.0 | 79.5 | ... | 0 | 1 | 2012-09 | 67 | 0 | 1 | 11.75 | 1560.375 | 1 | 62.42 |
| 1 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 2014-03-08 | 38 | 11.0 | 1.0 | ... | 0 | 0 | 2014-03 | 70 | 2 | 3 | 10.25 | 27.000 | 0 | 4.50 |
| 2 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 2013-08-21 | 26 | 426.0 | 49.0 | ... | 0 | 0 | 2013-08 | 59 | 0 | 2 | 10.79 | 776.000 | 0 | 36.95 |
| 3 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 2014-02-10 | 26 | 11.0 | 4.0 | ... | 0 | 0 | 2014-02 | 40 | 1 | 3 | 10.32 | 53.000 | 0 | 6.62 |
| 4 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 2014-01-19 | 94 | 173.0 | 43.0 | ... | 0 | 0 | 2014-01 | 43 | 1 | 3 | 10.38 | 422.000 | 0 | 22.21 |
5 rows × 34 columns
#looking for missing values
df_copy.info()
<class 'pandas.core.frame.DataFrame'> Index: 2027 entries, 0 to 2239 Data columns (total 34 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year_Birth 2027 non-null int64 1 Education 2027 non-null object 2 Marital_Status 2027 non-null object 3 Income 2027 non-null float64 4 Kidhome 2027 non-null int64 5 Teenhome 2027 non-null int64 6 Dt_Customer 2027 non-null datetime64[ns] 7 Recency 2027 non-null int64 8 MntWines 2027 non-null float64 9 MntFruits 2027 non-null float64 10 MntMeatProducts 2027 non-null float64 11 MntFishProducts 2027 non-null float64 12 MntSweetProducts 2027 non-null float64 13 MntGoldProds 2027 non-null float64 14 NumDealsPurchases 2027 non-null float64 15 NumWebPurchases 2027 non-null float64 16 NumCatalogPurchases 2027 non-null float64 17 NumStorePurchases 2027 non-null int64 18 NumWebVisitsMonth 2027 non-null float64 19 AcceptedCmp3 2027 non-null int64 20 AcceptedCmp4 2027 non-null int64 21 AcceptedCmp5 2027 non-null int64 22 AcceptedCmp1 2027 non-null int64 23 AcceptedCmp2 2027 non-null int64 24 Complain 2027 non-null int64 25 Response 2027 non-null int64 26 Year_Month 2027 non-null period[M] 27 Customer_age 2027 non-null int64 28 Total_kids 2027 non-null int64 29 Family_Size 2027 non-null int64 30 Time_since_enrol 2027 non-null float64 31 Total_spendings 2027 non-null float64 32 Total_offers 2027 non-null int64 33 Amount_per_purchase 2027 non-null float64 dtypes: datetime64[ns](1), float64(14), int64(16), object(2), period[M](1) memory usage: 618.8+ KB
#looking for inf values
df_copy[df_copy['Amount_per_purchase']==np.inf]
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | Complain | Response | Year_Month | Customer_age | Total_kids | Family_Size | Time_since_enrol | Total_spendings | Total_offers | Amount_per_purchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 655 | 1975 | Graduation | Divorced | 118043.5 | 0 | 0 | 2014-02-07 | 81 | 1.0 | 1.0 | ... | 0 | 0 | 2014-02 | 49 | 0 | 1 | 10.32 | 6.0 | 0 | inf |
| 981 | 1965 | Graduation | Divorced | 4861.0 | 0 | 0 | 2014-06-22 | 20 | 2.0 | 1.0 | ... | 0 | 0 | 2014-06 | 59 | 0 | 1 | 9.95 | 6.0 | 0 | inf |
| 1524 | 1973 | Graduation | Single | 3502.0 | 1 | 0 | 2013-04-13 | 56 | 2.0 | 1.0 | ... | 0 | 0 | 2013-04 | 51 | 1 | 2 | 11.15 | 5.0 | 0 | inf |
| 2132 | 1949 | PhD | Married | 118043.5 | 0 | 0 | 2013-08-29 | 85 | 2.0 | 1.0 | ... | 0 | 0 | 2013-08 | 75 | 0 | 2 | 10.77 | 8.0 | 0 | inf |
4 rows × 34 columns
#drop these rows
df_copy = df_copy[~df_copy['Amount_per_purchase'].isin([np.inf, -np.inf])]
df_copy.info()
<class 'pandas.core.frame.DataFrame'> Index: 2023 entries, 0 to 2239 Data columns (total 34 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Year_Birth 2023 non-null int64 1 Education 2023 non-null object 2 Marital_Status 2023 non-null object 3 Income 2023 non-null float64 4 Kidhome 2023 non-null int64 5 Teenhome 2023 non-null int64 6 Dt_Customer 2023 non-null datetime64[ns] 7 Recency 2023 non-null int64 8 MntWines 2023 non-null float64 9 MntFruits 2023 non-null float64 10 MntMeatProducts 2023 non-null float64 11 MntFishProducts 2023 non-null float64 12 MntSweetProducts 2023 non-null float64 13 MntGoldProds 2023 non-null float64 14 NumDealsPurchases 2023 non-null float64 15 NumWebPurchases 2023 non-null float64 16 NumCatalogPurchases 2023 non-null float64 17 NumStorePurchases 2023 non-null int64 18 NumWebVisitsMonth 2023 non-null float64 19 AcceptedCmp3 2023 non-null int64 20 AcceptedCmp4 2023 non-null int64 21 AcceptedCmp5 2023 non-null int64 22 AcceptedCmp1 2023 non-null int64 23 AcceptedCmp2 2023 non-null int64 24 Complain 2023 non-null int64 25 Response 2023 non-null int64 26 Year_Month 2023 non-null period[M] 27 Customer_age 2023 non-null int64 28 Total_kids 2023 non-null int64 29 Family_Size 2023 non-null int64 30 Time_since_enrol 2023 non-null float64 31 Total_spendings 2023 non-null float64 32 Total_offers 2023 non-null int64 33 Amount_per_purchase 2023 non-null float64 dtypes: datetime64[ns](1), float64(14), int64(16), object(2), period[M](1) memory usage: 553.2+ KB
df_copy
| Year_Birth | Education | Marital_Status | Income | Kidhome | Teenhome | Dt_Customer | Recency | MntWines | MntFruits | ... | Complain | Response | Year_Month | Customer_age | Total_kids | Family_Size | Time_since_enrol | Total_spendings | Total_offers | Amount_per_purchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1957 | Graduation | Single | 58138.0 | 0 | 0 | 2012-09-04 | 58 | 635.0 | 79.5 | ... | 0 | 1 | 2012-09 | 67 | 0 | 1 | 11.75 | 1560.375 | 1 | 62.42 |
| 1 | 1954 | Graduation | Single | 46344.0 | 1 | 1 | 2014-03-08 | 38 | 11.0 | 1.0 | ... | 0 | 0 | 2014-03 | 70 | 2 | 3 | 10.25 | 27.000 | 0 | 4.50 |
| 2 | 1965 | Graduation | Together | 71613.0 | 0 | 0 | 2013-08-21 | 26 | 426.0 | 49.0 | ... | 0 | 0 | 2013-08 | 59 | 0 | 2 | 10.79 | 776.000 | 0 | 36.95 |
| 3 | 1984 | Graduation | Together | 26646.0 | 1 | 0 | 2014-02-10 | 26 | 11.0 | 4.0 | ... | 0 | 0 | 2014-02 | 40 | 1 | 3 | 10.32 | 53.000 | 0 | 6.62 |
| 4 | 1981 | PhD | Married | 58293.0 | 1 | 0 | 2014-01-19 | 94 | 173.0 | 43.0 | ... | 0 | 0 | 2014-01 | 43 | 1 | 3 | 10.38 | 422.000 | 0 | 22.21 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2233 | 1977 | Graduation | Together | 118043.5 | 1 | 0 | 2013-06-02 | 23 | 9.0 | 14.0 | ... | 0 | 0 | 2013-06 | 47 | 1 | 3 | 11.01 | 62.000 | 0 | 5.64 |
| 2235 | 1967 | Graduation | Married | 61223.0 | 0 | 1 | 2013-06-13 | 46 | 709.0 | 43.0 | ... | 0 | 0 | 2013-06 | 57 | 1 | 3 | 10.98 | 1185.375 | 0 | 65.85 |
| 2237 | 1981 | Graduation | Divorced | 56981.0 | 0 | 0 | 2014-01-25 | 91 | 908.0 | 48.0 | ... | 0 | 0 | 2014-01 | 43 | 0 | 1 | 10.36 | 1241.000 | 1 | 65.32 |
| 2238 | 1956 | Master | Together | 69245.0 | 0 | 1 | 2014-01-24 | 8 | 428.0 | 30.0 | ... | 0 | 0 | 2014-01 | 68 | 1 | 3 | 10.36 | 843.000 | 0 | 36.65 |
| 2239 | 1954 | PhD | Married | 52869.0 | 1 | 1 | 2012-10-15 | 40 | 84.0 | 3.0 | ... | 0 | 1 | 2012-10 | 70 | 2 | 4 | 11.64 | 172.000 | 1 | 15.64 |
2023 rows × 34 columns
df_copy.columns
Index(['Year_Birth', 'Education', 'Marital_Status', 'Income', 'Kidhome',
'Teenhome', 'Dt_Customer', 'Recency', 'MntWines', 'MntFruits',
'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
'MntGoldProds', 'NumDealsPurchases', 'NumWebPurchases',
'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1',
'AcceptedCmp2', 'Complain', 'Response', 'Year_Month', 'Customer_age',
'Total_kids', 'Family_Size', 'Time_since_enrol', 'Total_spendings',
'Total_offers', 'Amount_per_purchase'],
dtype='object')
#Multivariate analysis with new variables
# correlation between all variables: bivariate analysis for numerical values, removing categorical variables, to focus on significant data
fig = plt.figure(figsize=(17,5))
sns.heatmap(df_copy[['Income', 'Kidhome',
'Teenhome', 'NumDealsPurchases', 'NumWebPurchases',
'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
'Customer_age', 'Family_Size', 'Time_since_enrol', 'Total_spendings', 'Total_offers',
'Amount_per_purchase']].corr(), annot = True, cmap='coolwarm', vmin=-1, vmax=1)
plt.xticks(rotation = 60)
plt.show()
Time since enrollement is positively correlated with Web purchases, Deals purchases, number of visits per month
Insights have been reported above after each graph.
All products purchases are correlated with each others and can be grouped. Main significant offers are cmp5 and response, but they can all be groupes together.
Complain and recency are not correlated with anything else and can be dropped. Family size can replace Kidhome and Teenhome and reflect marital status partially.
Income and age seem to be inversely correlated and I suspect they will be the main characteritics to identify clusters.
Among purchases, there seem to be one group behaving the same way: store, catalog and web on one side and deals on the other side. They seem to be inversely correlated.
Higher income seem to be correlated with higher age, more products purchase, less deals purchases, less kids at home and higher education.
Choosing segmentation attributes: Time since enrollement, amount per purchase, Total spending, Total offers, NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth'
Choosing the profiling attributes: 'Year_Birth', 'Education', 'Marital_Status', 'Income', 'Customer_age','Family_Size'
# Data selected from df_copy and that will be used for analysis: df-segm
df_segm = df_copy[['Income','Kidhome','Education', 'Marital_Status',
'Teenhome', 'NumDealsPurchases', 'NumWebPurchases',
'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth','MntWines', 'MntFruits',
'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts',
'MntGoldProds','Recency',
'Customer_age', 'Family_Size', 'Time_since_enrol', 'Total_spendings', 'Total_offers',
'Amount_per_purchase']]
# Attribute data:
df_attr = df_segm[['NumDealsPurchases', 'NumWebPurchases',
'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
'Time_since_enrol', 'Total_offers',
'Amount_per_purchase']]
# Profile data selected from df_copy and that will be used for clusters profile: df_profile
df_profile = df_segm[['Education', 'Marital_Status', 'Income', 'Customer_age','Family_Size']]
df_profile.reset_index(drop=True, inplace=True)
df_profile
| Education | Marital_Status | Income | Customer_age | Family_Size | |
|---|---|---|---|---|---|
| 0 | Graduation | Single | 58138.0 | 67 | 1 |
| 1 | Graduation | Single | 46344.0 | 70 | 3 |
| 2 | Graduation | Together | 71613.0 | 59 | 2 |
| 3 | Graduation | Together | 26646.0 | 40 | 3 |
| 4 | PhD | Married | 58293.0 | 43 | 3 |
| ... | ... | ... | ... | ... | ... |
| 2018 | Graduation | Together | 118043.5 | 47 | 3 |
| 2019 | Graduation | Married | 61223.0 | 57 | 3 |
| 2020 | Graduation | Divorced | 56981.0 | 43 | 1 |
| 2021 | Master | Together | 69245.0 | 68 | 3 |
| 2022 | PhD | Married | 52869.0 | 70 | 4 |
2023 rows × 5 columns
#get dummies
df_attr = pd.get_dummies(df_attr, drop_first =True)
# Scaling the data and storing the output as a new DataFrame
scaler = StandardScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(df_attr), columns = df_attr.columns)
data_scaled.head()
| NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | Time_since_enrol | Total_offers | Amount_per_purchase | |
|---|---|---|---|---|---|---|---|---|
| 0 | 0.500769 | 1.475284 | 2.700124 | -0.555750 | 0.726408 | 1.532331 | 0.613153 | 1.192302 |
| 1 | -0.151133 | -1.164865 | -0.597955 | -1.174167 | -0.130707 | -1.174695 | -0.509389 | -1.024993 |
| 2 | -0.803035 | 1.475284 | -0.231501 | 1.299501 | -0.559265 | -0.200166 | -0.509389 | 0.217259 |
| 3 | -0.151133 | -0.787701 | -0.964408 | -0.555750 | 0.297851 | -1.048367 | -0.509389 | -0.943835 |
| 4 | 1.804572 | 0.343792 | 0.134952 | 0.062667 | -0.130707 | -0.940086 | -0.509389 | -0.347018 |
data_scaled_copy= data_scaled.copy(deep=True)
#Let's try to visualize the data for different perplexity values
for i in range(10, 50, 5):
tsne = TSNE(n_components = 2, random_state = 1, perplexity = i)
data_scaled_copy_tsne = tsne.fit_transform(data_scaled_copy)
data_scaled_tsne = pd.DataFrame(data_scaled_copy_tsne)
data_scaled_tsne.columns = ['X1', 'X2']
plt.figure(figsize = (5,5))
sns.scatterplot(x = 'X1', y = 'X2', data = data_scaled_tsne)
plt.title("perplexity = {}".format(i))
Observation and Insights: It is very difficukt to identify clearly any cluster with any perplexity
Think about it:
We will apply clustering algorithms on pca data as no clear cluster can be easily identified from tSNE, and there are many fetaures on orignal data
# Defining the number of principal components to generate
n = data_scaled_copy.shape[1]
# Finding principal components for the data
pca1 = PCA(n_components = n, random_state = 42)
data_scaled_copy_pca = pd.DataFrame(pca1.fit_transform(data_scaled_copy))
data_scaled_copy_pca
# The percentage of variance explained by each principal component
exp_var1 = pca1.explained_variance_ratio_
exp_var1
array([0.3895066 , 0.21586571, 0.12276702, 0.09498901, 0.061424 ,
0.04817276, 0.03760428, 0.02967062])
# Visualize the explained variance by individual components
plt.figure(figsize = (10, 8))
plt.plot(range(1, 9), pca1.explained_variance_ratio_.cumsum(), marker = 'o', linestyle = '--')
plt.title("Explained Variances by Components")
plt.xlabel("Number of Components")
plt.ylabel("Cumulative Explained Variance")
plt.show()
First 2 components explain 60% of variance of data, 3 components 60%. This is not very satisfactory. 70% is achieved with 3 components
data_scaled_copy_pca
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2.142691 | 1.861015 | 1.150941 | -0.397170 | 0.187128 | -1.885127 | -1.108254 | -0.588348 |
| 1 | -1.858299 | -1.334782 | -0.292856 | 0.454520 | 0.412357 | -0.249644 | -0.248203 | 0.042585 |
| 2 | 1.158323 | -0.036337 | -1.050881 | -0.192170 | -1.505534 | 0.476404 | 0.146352 | 0.473095 |
| 3 | -1.747297 | -0.834835 | -0.412753 | 0.481666 | -0.105158 | 0.204336 | 0.051958 | -0.111660 |
| 4 | -0.255112 | 0.691678 | -1.447538 | 1.090458 | 0.843309 | -0.371551 | 0.010073 | 0.016424 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2018 | -1.614064 | 0.669353 | -0.315548 | 0.114864 | 0.760880 | -0.207679 | -0.162363 | 0.372103 |
| 2019 | 1.015851 | 0.573296 | -0.201157 | -0.052381 | -1.110435 | -1.558725 | 0.608450 | 0.627389 |
| 2020 | 1.471442 | -0.926131 | -0.086432 | 0.264990 | -0.342222 | 1.455138 | 1.081567 | -1.689674 |
| 2021 | 1.493209 | -0.572487 | -1.536689 | 0.253703 | -0.295094 | 0.164763 | -0.310308 | -0.146040 |
| 2022 | -1.004285 | 1.083098 | 1.159305 | -0.337881 | 0.453424 | 0.408686 | -0.104356 | 0.321399 |
2023 rows × 8 columns
#Display coefficients for each PCA
pca1.components_
array([[-0.07705304, 0.33832866, 0.49335718, 0.43501722, -0.35502282,
0.06265626, 0.28515931, 0.48546104],
[ 0.58548139, 0.4264846 , -0.01277399, 0.13064522, 0.45308201,
0.50055509, 0.00856716, -0.04667951],
[-0.31066639, -0.1811064 , -0.01099432, -0.37609205, 0.27173606,
0.37301149, 0.69924042, 0.16494099],
[ 0.32594832, 0.2583624 , -0.06733621, -0.084557 , 0.13137816,
-0.71833872, 0.52128916, -0.1015351 ],
[ 0.64271357, -0.60295541, 0.24764228, -0.10927976, -0.34213284,
0.10570897, 0.14671606, 0.01845023],
[-0.08147448, -0.25398162, -0.35369141, 0.73436696, -0.03990672,
0.10653671, 0.33430433, -0.37384596],
[ 0.16006579, -0.12411817, -0.62962583, 0.08956788, 0.02362959,
-0.07235858, -0.06735089, 0.73769388],
[ 0.05637887, 0.40078266, -0.41100788, -0.29298026, -0.67702954,
0.25022086, 0.14181002, -0.20084874]])
#. better representation of the data PCA and variables
cols = ['PCA1', 'PCA2', 'PCA3', 'PCA4', 'PCA5', 'PCA6', 'PCA7','PCA8']
pc1 = pd.DataFrame(np.round(pca1.components_.T[:, 0:10], 2),index = data_scaled_copy.columns, columns = cols )
pc1
| PCA1 | PCA2 | PCA3 | PCA4 | PCA5 | PCA6 | PCA7 | PCA8 | |
|---|---|---|---|---|---|---|---|---|
| NumDealsPurchases | -0.08 | 0.59 | -0.31 | 0.33 | 0.64 | -0.08 | 0.16 | 0.06 |
| NumWebPurchases | 0.34 | 0.43 | -0.18 | 0.26 | -0.60 | -0.25 | -0.12 | 0.40 |
| NumCatalogPurchases | 0.49 | -0.01 | -0.01 | -0.07 | 0.25 | -0.35 | -0.63 | -0.41 |
| NumStorePurchases | 0.44 | 0.13 | -0.38 | -0.08 | -0.11 | 0.73 | 0.09 | -0.29 |
| NumWebVisitsMonth | -0.36 | 0.45 | 0.27 | 0.13 | -0.34 | -0.04 | 0.02 | -0.68 |
| Time_since_enrol | 0.06 | 0.50 | 0.37 | -0.72 | 0.11 | 0.11 | -0.07 | 0.25 |
| Total_offers | 0.29 | 0.01 | 0.70 | 0.52 | 0.15 | 0.33 | -0.07 | 0.14 |
| Amount_per_purchase | 0.49 | -0.05 | 0.16 | -0.10 | 0.02 | -0.37 | 0.74 | -0.20 |
plt.figure(figsize=(8, 6))
sns.scatterplot(x=data_scaled_copy_pca.iloc[:,0], y=data_scaled_copy_pca.iloc[:,1], data=data_scaled_copy_pca)
plt.title('2 Component PCA')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid()
plt.show()
We can see one dense cluster on the left side of the graph and more spread data on the right
Observation and Insights:
Think About It:
# Empty dictionary to store the SSE for each value of K
sse = {}
# Iterate for a range of Ks and fit the scaled data to the algorithm.
# Use inertia attribute from the clustering object and store the inertia value for that K
k_means_df = data_scaled_copy_pca.copy()
for k in range(1, 10):
kmeans = KMeans(n_clusters = k, random_state = 1).fit(k_means_df)
sse[k] = kmeans.inertia_
# Elbow plot
plt.figure()
plt.plot(list(sse.keys()),list(sse.values()), 'bx-')
plt.xlabel("Number of cluster")
plt.ylabel("SSE")
plt.show()
From Elbow method, 2 or 3 clusters seem the best. Let's see the silhouette score
# Range of cluster numbers to try
cluster_range = range(2, 11)
silhouette_scores = []
# Perform KMeans clustering for each cluster number and calculate silhouette score
for n_clusters in cluster_range:
kmeans = KMeans(n_clusters=n_clusters, random_state=42)
cluster_labels = kmeans.fit_predict(k_means_df)
silhouette_avg = silhouette_score(k_means_df, cluster_labels)
silhouette_scores.append(silhouette_avg)
# Create a line plot of the silhouette scores
plt.figure(figsize=(8, 6))
plt.plot(cluster_range, silhouette_scores)
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score for Different Cluster Numbers')
plt.show()
2 clsuters has the highest silhouette score. So we will choose 2 clusters.
# visualizing the 2 clusters on PCA graph.As 2 PCAs reflext only 50% of the data, I chose to do a 3D graph
k_means_df = data_scaled_copy_pca.copy(deep=True)
kmeans = KMeans(n_clusters = 2, random_state = 1)
kmeans.fit(k_means_df)
# Adding predicted labels to the original data and the scaled data
k_means_df['KMeans_Labels'] = kmeans.predict(data_scaled_copy_pca)
labels1 = kmeans.labels_
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
ax.set_title('3D PCA Visualization') # choosing 3D as 70% were best represented with 3 PCAs
ax.scatter(k_means_df.iloc[:,0], k_means_df.iloc[:,1], k_means_df.iloc[:,2], c=labels1, cmap='viridis')
plt.title('3 Component PCA')
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel ('Principal Component 3')
plt.grid()
plt.show()
# As I hesitated with 3 clusters, Let's look at what 3 clusters would look like
# visualizing the 2 clusters on PCA graph.As 2 PCAs reflext only 50% of the data, I chose to do a 3D graph
k_means_df = data_scaled_copy_pca.copy()
kmeans3 = KMeans(n_clusters = 3, random_state = 1)
kmeans3.fit(k_means_df)
# Adding predicted labels to the original data and the scaled data
k_means_df['KMeans_Labels'] = kmeans.predict(data_scaled_copy_pca)
labels1 = kmeans3.labels_
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
ax.set_title('3D PCA Visualization')
ax.scatter(k_means_df.iloc[:,0], k_means_df.iloc[:,1], k_means_df.iloc[:,2], c=labels1, cmap='viridis')
plt.title('3 Component PCA')
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel ('Principal Component 3')
plt.grid()
plt.show()
This third cluster stands between the 2 clusters, so does not seem to add new information. I will continue with 2 clusters
# Creating a copy of the original data
df1 = df_segm.copy(deep=True)
# Adding K-Means cluster labels to the K-Means and original dataframes
#k_means_df["KM_segments"] = kmeans.labels_
df1["KM_segments"] = kmeans.labels_
km_cluster_profile = df1.groupby("KM_segments").mean(numeric_only = True)
km_cluster_profile['Total_count'] = (df1.groupby("KM_segments")["Total_offers"].count().values)
km_cluster_profile.style.highlight_max(color = "lightgreen", axis = 0)
| Income | Kidhome | Teenhome | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | Recency | Customer_age | Family_Size | Time_since_enrol | Total_spendings | Total_offers | Amount_per_purchase | Total_count | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| KM_segments | ||||||||||||||||||||||
| 0 | 37270.609626 | 0.732620 | 0.531194 | 2.386809 | 2.755793 | 0.747772 | 3.648841 | 6.436720 | 77.013369 | 6.233512 | 32.981283 | 9.206328 | 6.562500 | 20.777629 | 48.588235 | 53.393939 | 2.912656 | 10.847175 | 152.790664 | 0.194296 | 13.192308 | 1122 |
| 1 | 70125.679245 | 0.085461 | 0.482797 | 2.038846 | 5.748058 | 4.977802 | 8.472808 | 3.895671 | 586.955605 | 40.503885 | 298.227525 | 60.296892 | 41.596004 | 62.447836 | 49.192009 | 57.213097 | 2.203108 | 10.967836 | 1100.731410 | 0.776915 | 53.792575 | 901 |
df_profile
| Education | Marital_Status | Income | Customer_age | Family_Size | |
|---|---|---|---|---|---|
| 0 | Graduation | Single | 58138.0 | 67 | 1 |
| 1 | Graduation | Single | 46344.0 | 70 | 3 |
| 2 | Graduation | Together | 71613.0 | 59 | 2 |
| 3 | Graduation | Together | 26646.0 | 40 | 3 |
| 4 | PhD | Married | 58293.0 | 43 | 3 |
| ... | ... | ... | ... | ... | ... |
| 2018 | Graduation | Together | 118043.5 | 47 | 3 |
| 2019 | Graduation | Married | 61223.0 | 57 | 3 |
| 2020 | Graduation | Divorced | 56981.0 | 43 | 1 |
| 2021 | Master | Together | 69245.0 | 68 | 3 |
| 2022 | PhD | Married | 52869.0 | 70 | 4 |
2023 rows × 5 columns
df1_cat = df1[['Marital_Status', 'Education','Family_Size' ,"KM_segments"]]
df1_cat.head()
| Marital_Status | Education | Family_Size | KM_segments | |
|---|---|---|---|---|
| 0 | Single | Graduation | 1 | 1 |
| 1 | Single | Graduation | 3 | 0 |
| 2 | Together | Graduation | 2 | 1 |
| 3 | Together | Graduation | 3 | 0 |
| 4 | Married | PhD | 3 | 0 |
#fvisualisation of KM clusters by cat values
fig, ax =plt.subplots(1,2,figsize=(10,5))
order = ['Single', 'Widow', 'Divorced', 'Married', 'Together']
for i in range (2):
plt.subplot(1,2,i+1)
sns.countplot(x='Marital_Status', data = df1_cat[df1_cat["KM_segments"]==i], palette ='Set1', stat = 'percent', order = order)
plt.xticks(rotation=60, fontsize=8)
plt.title ('"KM_segments" = {}'.format(i))
plt.show()
Same profile of marital status between the 2 clusters
#for i in range (2):
fig, ax =plt.subplots(1,2,figsize=(10,5))
for i in range (2):
plt.subplot(1,2,i+1)
sns.countplot(x='Family_Size', data = df1_cat[df1_cat["KM_segments"]==i], palette ='Set1', stat = 'percent')
plt.xticks(rotation=60, fontsize=8)
plt.title ('"KM_segments" = {}'.format(i))
plt.show()
Larger family size for cluster 0 (first cluster)
order = ['Basic', 'Graduation', 'Master', 'PhD']
#for i in range (2):
fig, ax =plt.subplots(1,2,figsize=(10,5))
for i in range (2):
plt.subplot(1,2,i+1)
sns.countplot(x='Education', data = df1_cat[df1_cat["KM_segments"]==i], palette ='Set1', stat = 'percent', order = order)
plt.xticks(rotation=60, fontsize=8)
plt.title ('"KM_segments" = {}'.format(i))
plt.show()
Comparable education status between the 2 clusters.
Age: The average age is 53.
Income: The average income is 38,162.
Family size: On average, there are 2.3 members per household
Recency: The average recency is 49 days.
Purchasing Behavior:
Wines: 93.82
Fruits: 5.85
Meat Products: 33.88
Fish Products: 9.40
Sweet Products: 5.85
Gold Products: 5.85
Purchases:
Deals: 2.85
Web: 3.85
Catalog: 1.85
Store: 4.85
Web Visits:3.85
AgeThe average age is 57
Income: The average income is 70 126.
Family size:On average, there are On average, there are 2.9 members per household
Purchasing Behavior: Wines: 592.46
Fruits: 51.62
Meat Products: 352.96
Fish Products: 75.63
Sweet Products: 42.96
Gold Products: 44.96
Purchases:
Deals: 2.04
Web: 5.7
Catalog: 4.99
Store: 8.47
Amount per purchase: 49.62
Web Visits: 3.89
Time since enrollment: 10.99
Total spendings:1100.73
Total offers: 0.77
The first cluster contains 901 observations. It has a higher total offers, amount per purchase, total spendings, number of web, catalog and store purchases. These clients have higher income, are older and have less family members, less kids and teens at home.
The second cluster contains 1122 observations. Its attributes are higher number of deals purchases and webs visits per month, but lower web, catalog, store purchases, lower offers and lower amount per purchase. Its profile is lower income,larger family size with more kids and Teens at home,
Time since enrollment is comparable between clusters.
Think About It:
Summary of each cluster: These 2 clusters do not differentiate for the different types of channel the clients are using. They spend globally more, take on more offers, are not interested in deals.
k_med_df = data_scaled_copy_pca.copy(deep=True)
kmed = KMedoids(n_clusters = 2, random_state = 1)
kmed.fit(k_med_df)
KMedoids(n_clusters=2, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMedoids(n_clusters=2, random_state=1)
# Creating a copy of the original data
df2 = df_segm.copy()
# Add K-Medoids cluster labels to K-Medoids data
k_med_df["KMed_segments"] = kmed.labels_
# Add K-Medoids cluster labels to the whole data
df2["KMed_segments"] = kmed.labels_
Kmedlabels = kmed.labels_
plt.figure(figsize=(8, 6))
sns.scatterplot(x=k_med_df.iloc[:,0], y=k_med_df.iloc[:,1], data=k_med_df, hue = "KMed_segments")
plt.title('2 Component PCA')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid()
plt.show()
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')
ax.set_title('3D PCA Visualization')
ax.scatter(k_med_df.iloc[:,0], k_med_df.iloc[:,1], k_med_df.iloc[:,2], c=Kmedlabels, cmap='viridis')
plt.title('3 Component PCA')
ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel ('Principal Component 3')
plt.grid()
plt.show()
kmed_cluster_profile = df2.groupby("KMed_segments").mean(numeric_only = True)
kmed_cluster_profile["count_in_each_segment"] = (
df2.groupby("KMed_segments")["Total_offers"].count().values)
kmed_cluster_profile.style.highlight_max(color = "lightgreen", axis = 0)
| Income | Kidhome | Teenhome | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | Recency | Customer_age | Family_Size | Time_since_enrol | Total_spendings | Total_offers | Amount_per_purchase | count_in_each_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| KMed_segments | ||||||||||||||||||||||
| 0 | 35765.411825 | 0.753313 | 0.481142 | 2.026504 | 2.286442 | 0.605505 | 3.281346 | 6.371050 | 50.101937 | 5.099388 | 24.873598 | 7.481142 | 5.261850 | 16.751274 | 48.604485 | 53.036697 | 2.880734 | 10.795515 | 109.576835 | 0.193680 | 11.779990 | 981 |
| 1 | 67096.921305 | 0.153551 | 0.536468 | 2.425144 | 5.785029 | 4.539347 | 8.166027 | 4.301344 | 543.287908 | 36.934261 | 269.968330 | 55.007678 | 38.079894 | 60.599808 | 49.095010 | 57.032630 | 2.329175 | 11.000144 | 1013.143234 | 0.698656 | 49.628321 | 1042 |
Education_order = ['Basic', 'Graduation', 'Master', 'PhD']
print('Education for each label')
plt.subplots(1,2,figsize=(10,5))
for i in range(2):
plt.subplot(1,2,i+1)
sns.countplot(x='Education', data = df2[df2["KMed_segments"]==i], palette ='Set1', hue = 'Education', order = Education_order , stat ='percent')
plt.xticks(rotation=60, fontsize=8)
plt.title('Education for label {}'.format(i))
plt.show()
Marital_order = ['Single', 'Widow', 'Divorced', 'Married', 'Together']
print('Marital Status for each label')
plt.subplots(1,2,figsize=(10,5))
for i in range(2):
plt.subplot(1,2,i+1)
sns.countplot(x='Marital_Status', data = df2[df2["KMed_segments"]==i], palette ='Set1', hue = 'Marital_Status',order = Marital_order, stat ='percent')
plt.xticks(rotation=60, fontsize=8)
plt.title('Marital Status for label {}'.format(i))
plt.show()
Family_size = ['1','2','3','4','5']
print('Family Size for each label')
fig, ax =plt.subplots(1,2,figsize=(10,5))
for i in range (2):
plt.subplot(1,2,i+1)
sns.countplot(x='Family_Size', data = df2[df2["KMed_segments"]==i], palette ='Set1', stat = 'percent')
plt.xticks(rotation=60, fontsize=8)
plt.title ('Family Size for label {}'.format(i))
plt.show()
Education for each label
Marital Status for each label
Family Size for each label
Age: The average age is 53.
Income: The average income is 35 765.
Family size: On average, there are 2.88 members per household
Recency: The average recency is 48.6 days.
Purchasing Behavior:
Wines: 50.1
Fruits: 5.099
Meat Products: 24.87
Fish Products: 7.48
Sweet Products: 5.26
Gold Products: 16.75
Purchases:
Deals: 2.02
Web: 2.28
Catalog: 0.60
Store: 3.28
Web Visits:6.37
Time since enrollment: 10.79
Total Spendings: 109.57
Total offers: 0.19
Amount per purchase: 11.77
Age: The average age is 57.
Income: The average income is 67 096.
Family size: On average, there are 2.3 members per household
Recency: The average recency is 49.09 days.
Purchasing Behavior:
Wines: 543.28
Fruits: 36.93
Meat Products: 269.96
Fish Products: 55
Sweet Products: 38.08
Gold Products: 60.6
Purchases:
Deals: 2.42
Web: 5.78
Catalog: 4.53
Store: 8.16
Web Visits:4.3
Time since enrollment: 11
Total Spendings: 1013.14
Total offers: 0.69
Amount per purchase: 49.62
Summary for each cluster:
First cluster: buy less products, no prefer channel, lower amount spent,higher visits per months, take on less offers, and also less deals. Their profile is younger people, lower income, more kids at home, bigger family. No impact of education or marital status.
second cluster: buy more products, through all channels, higher amount spent, fewer visits per months, more offers, also more deals. Their profile is older people, higher income, less kids at home, smaller family. No impact of education or marital status.
Time since enrollment not different between clusters abit higher for second cluster
Observations and Insights: Comparison with KMeans clusters: Cluster 1 with higher income takes on less deals in KMeans but more deals on KmMedoids.
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.cluster.hierarchy import cophenet
from scipy.spatial.distance import pdist
#Create new dataframe
hc_df = data_scaled_copy_pca.copy(deep=True)
hc_df1 = hc_df.copy(deep=True)
# List of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
# List of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock", 'cosine']
for metric in distance_metrics:
for method in linkage_methods:
Z = linkage(hc_df1, method=method, metric=metric)
cophenetic_corr, cophenetic_dist = cophenet(Z, pdist(hc_df1, metric=metric))
print(f'Cophenetic Correlation Coefficient for metric', metric, 'and method', method, 'is', cophenetic_corr )
#Z = linkage(hc_df1, method=method)
#cophenetic_corr, cophenetic_dist = cophenet(Z, pdist(hc_df1))
#print(f'Cophenetic Correlation Coefficient for {method}: {cophenetic_corr}')
Cophenetic Correlation Coefficient for metric euclidean and method single is 0.5739142238459091 Cophenetic Correlation Coefficient for metric euclidean and method complete is 0.5363585727989066 Cophenetic Correlation Coefficient for metric euclidean and method average is 0.7447600511914452 Cophenetic Correlation Coefficient for metric euclidean and method weighted is 0.6255148063460467 Cophenetic Correlation Coefficient for metric chebyshev and method single is 0.442583656212713 Cophenetic Correlation Coefficient for metric chebyshev and method complete is 0.5767882662462652 Cophenetic Correlation Coefficient for metric chebyshev and method average is 0.6531170174345047 Cophenetic Correlation Coefficient for metric chebyshev and method weighted is 0.5586349008606936 Cophenetic Correlation Coefficient for metric mahalanobis and method single is 0.7572491961515623 Cophenetic Correlation Coefficient for metric mahalanobis and method complete is 0.6449531172870345 Cophenetic Correlation Coefficient for metric mahalanobis and method average is 0.7878164560966789 Cophenetic Correlation Coefficient for metric mahalanobis and method weighted is 0.5968961538443791 Cophenetic Correlation Coefficient for metric cityblock and method single is 0.6553089968704862 Cophenetic Correlation Coefficient for metric cityblock and method complete is 0.7213233813561795 Cophenetic Correlation Coefficient for metric cityblock and method average is 0.7573727400751508 Cophenetic Correlation Coefficient for metric cityblock and method weighted is 0.651606647519734 Cophenetic Correlation Coefficient for metric cosine and method single is 0.15814661248802556 Cophenetic Correlation Coefficient for metric cosine and method complete is 0.5643160463925817 Cophenetic Correlation Coefficient for metric cosine and method average is 0.7407623901130885 Cophenetic Correlation Coefficient for metric cosine and method weighted is 0.5862487308320113
The highest correlation is for mahalanobis metric and average method copheneitc coefficient of 0.7878. However, euclidean, cityblock and cosine also have high coefficients
#dendograms to define number of clusters using selected metrics and average method
selected_metrics = ["euclidean", "mahalanobis", "cityblock", 'cosine']
for metric in selected_metrics:
plt.figure(figsize = (30, 7))
Z = linkage(hc_df1, metric = metric, method = 'average')
dendrogram(Z)
plt.title('Dendrogram Linkage with average method and distance metric: {}'.format(metric))
plt.ylabel('Distance')
plt.show()
#as ward can be associated only with euclidean:
Z = linkage(data_scaled, method='ward', metric='euclidean')
cophenetic_corr, cophenetic_dist = cophenet(Z, pdist(data_scaled, metric='euclidean'))
print(cophenetic_corr)
0.5786595294496293
#dendogram for ward method and euclidean distance
Z = linkage(hc_df1, metric = "euclidean", method = 'ward')
dendrogram(Z)
plt.title('Dendrogram')
plt.ylabel('Distance')
plt.show()
The easiest to read are: cosine and average
euclidean and ward 2, 3 clusters or 5 clusters.
#Let's calculate silhouette score for clusters =2, 3 or 5
# Initializing Agglomerative Clustering with distance as mahalanobis, linkage as average
n_clusters=[2,3,5]
for i in n_clusters:
HCmodel = AgglomerativeClustering(n_clusters = i, metric = 'mahalanobis', linkage = "average")
# Fitting on PCA data
preds = HCmodel.fit_predict(data_scaled_copy_pca)
score = silhouette_score(data_scaled_copy_pca, preds) # Calculating the silhouette score
print('Silhouette score for metric mahalanobis linkage average', 'and' ,i, 'clusters is: ', score )
Silhouette score for metric mahalanobis linkage average and 2 clusters is: 0.4075437966858895 Silhouette score for metric mahalanobis linkage average and 3 clusters is: 0.2795961118812597 Silhouette score for metric mahalanobis linkage average and 5 clusters is: 0.18745764432083314
#Let's calculate silhouette score for clusters =2, 3 or 5
# Initializing Agglomerative Clustering with distance as euclidean, linkage as ward
n_clusters=[2,3,5]
for i in n_clusters:
HCmodel = AgglomerativeClustering(n_clusters = i, metric = 'euclidean', linkage = "ward")
# Fitting on PCA data
preds = HCmodel.fit_predict(data_scaled_copy_pca)
score = silhouette_score(data_scaled_copy_pca, preds) # Calculating the silhouette score
print('Silhouette score for metric euclidean, linkage ward', 'and' ,i, 'clusters is: ', score )
Silhouette score for metric euclidean, linkage ward and 2 clusters is: 0.2801489803225674 Silhouette score for metric euclidean, linkage ward and 3 clusters is: 0.2775679740517781 Silhouette score for metric euclidean, linkage ward and 5 clusters is: 0.18651916068595786
#Let's calculate silhouette score for clusters =2, 3 or 5
# Initializing Agglomerative Clustering with distance as cosine, linkage as average
n_clusters=[2,3,5]
for i in n_clusters:
HCmodel = AgglomerativeClustering(n_clusters = i, metric = 'cosine', linkage = 'average')
# Fitting on PCA data
preds = HCmodel.fit_predict(data_scaled_copy_pca)
score = silhouette_score(data_scaled_copy_pca, preds) # Calculating the silhouette score
print('Silhouette score for metric cosine, linkage average', 'and' ,i, 'clusters is: ', score )
Silhouette score for metric cosine, linkage average and 2 clusters is: 0.26120149278785415 Silhouette score for metric cosine, linkage average and 3 clusters is: 0.26765486643315906 Silhouette score for metric cosine, linkage average and 5 clusters is: 0.15657201266832385
As bth cophen coeff and silhouette score are optimal for average nethod, mahalsnobis distance and 2 clusters
HCmodel = AgglomerativeClustering(n_clusters = 2, metric = 'mahalanobis', linkage = "average")
# Fitting on PCA data
preds = HCmodel.fit_predict(data_scaled_copy_pca)
score = silhouette_score(data_scaled_copy_pca, preds) # Calculating the silhouette score
print('Silhouette score is: ',score )
Z = linkage(hc_df1, method="average", metric='mahalanobis')
cophenetic_corr, cophenetic_dist = cophenet(Z, pdist(hc_df1, metric='mahalanobis'))
print(f'Cophenetic Correlation Coefficient for metric mahalanobis and method average', 'is', cophenetic_corr )
Silhouette score is: 0.4075437966858895 Cophenetic Correlation Coefficient for metric mahalanobis and method average is 0.7878164560966789
# Clustering with 2 clusters, mahalanobis metric and average method
hierarchical = AgglomerativeClustering(n_clusters = 2, metric = 'mahalanobis', linkage = 'average')
hierarchical.fit(hc_df1)
AgglomerativeClustering(linkage='average', metric='mahalanobis')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(linkage='average', metric='mahalanobis')
hc_df1['HCLabels'] = hierarchical.labels_
hc_df1.HCLabels.value_counts()
HCLabels 0 2021 1 2 Name: count, dtype: int64
this is not a good spread of the data! Let's see for other options:
euclidean and ward
cosine and average
hc_df1.drop('HCLabels', axis=1, inplace = True)
# Clustering with 2 clusters, euclidean metric and ward method
hierarchical = AgglomerativeClustering(n_clusters = 2, metric = 'euclidean', linkage = 'ward')
hierarchical.fit(hc_df1)
AgglomerativeClustering(metric='euclidean')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(metric='euclidean')
hc_df1['HCLabels'] = hierarchical.labels_
hc_df1.HCLabels.value_counts()
HCLabels 0 1066 1 957 Name: count, dtype: int64
hc_df1.drop('HCLabels', axis=1, inplace = True)
# Clustering with 3 clusters, cosine metric and average method
hierarchical = AgglomerativeClustering(n_clusters = 3, metric = 'cosine', linkage = 'average')
hierarchical.fit(hc_df1)
AgglomerativeClustering(linkage='average', metric='cosine', n_clusters=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(linkage='average', metric='cosine', n_clusters=3)
hc_df1['HCLabels'] = hierarchical.labels_
hc_df1.HCLabels.value_counts()
HCLabels 1 836 2 620 0 567 Name: count, dtype: int64
this is better. So I go for cosine, average, 3 clusters. And cophenetic coeff was better for 3 clusters
plt.figure(figsize=(8, 6))
sns.scatterplot(x=hc_df1.iloc[:,0], y=hc_df1.iloc[:,1], data=hc_df1, hue = 'HCLabels')
plt.title('2 Component PCA')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid()
plt.show()
# Creating a copy of the original data
df3 = df_segm.copy()
# Adding hierarchical cluster labels to the HC algorithm and original dataframes
hc_df1['HCLabels'] = hierarchical.labels_
df3['HCLabels'] = hierarchical.labels_
Observations and Insights:
hc_cluster_profile = df3.groupby("HCLabels").mean(numeric_only = True)
hc_cluster_profile["count_in_each_segment"] = (
df3.groupby("HCLabels")["Total_offers"].count().values)
hc_cluster_profile.style.highlight_max(color = "lightgreen", axis = 0)
#hc_cluster_profile.style.highlight_min(color = "lightblue", axis = 0)
| Income | Kidhome | Teenhome | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | Recency | Customer_age | Family_Size | Time_since_enrol | Total_spendings | Total_offers | Amount_per_purchase | count_in_each_segment | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HCLabels | ||||||||||||||||||||||
| 0 | 55185.493827 | 0.375661 | 0.809524 | 3.758377 | 6.393298 | 2.723104 | 7.425044 | 6.269841 | 418.243827 | 23.375661 | 138.026455 | 30.985009 | 22.880291 | 52.358025 | 48.049383 | 57.576720 | 2.841270 | 11.124744 | 689.648810 | 0.373898 | 32.446208 | 567 |
| 1 | 33574.723086 | 0.794258 | 0.447368 | 1.867225 | 1.909091 | 0.440191 | 3.049043 | 6.460526 | 31.247608 | 4.312201 | 18.431818 | 6.135167 | 4.477273 | 13.680024 | 48.947368 | 52.334928 | 2.887560 | 10.796555 | 78.284091 | 0.168660 | 10.028852 | 836 |
| 2 | 73616.479839 | 0.035484 | 0.319355 | 1.327419 | 4.919355 | 5.503226 | 8.014516 | 2.864516 | 567.723790 | 42.950000 | 341.996774 | 67.676613 | 45.362903 | 62.023387 | 49.474194 | 56.546774 | 1.980645 | 10.836935 | 1139.860887 | 0.911290 | 58.851226 | 620 |
df3_cat = df3[['Marital_Status', 'Education','Family_Size' ,'HCLabels']]
df3_cat.head()
| Marital_Status | Education | Family_Size | HCLabels | |
|---|---|---|---|---|
| 0 | Single | Graduation | 1 | 2 |
| 1 | Single | Graduation | 3 | 1 |
| 2 | Together | Graduation | 2 | 0 |
| 3 | Together | Graduation | 3 | 1 |
| 4 | Married | PhD | 3 | 0 |
#fvisualisation of HC clusters by cat values
fig, ax =plt.subplots(1,2,figsize=(10,5))
order = ['Single', 'Widow', 'Divorced', 'Married', 'Together']
for i in range (3):
plt.subplot(1,3,i+1)
sns.countplot(x='Marital_Status', data = df3_cat[df3_cat['HCLabels']==i], palette ='Set1', stat = 'percent', order = order)
plt.xticks(rotation=60, fontsize=8)
plt.title ('HCLabels= {}'.format(i))
plt.show()
fig, ax =plt.subplots(1,2,figsize=(10,5))
for i in range (3):
plt.subplot(1,3,i+1)
sns.countplot(x='Family_Size', data = df3_cat[df3_cat['HCLabels']==i], palette ='Set1', stat = 'percent')
plt.xticks(rotation=60, fontsize=8)
plt.title ('HCLabels = {}'.format(i))
plt.show()
order = ['Basic', 'Graduation', 'Master', 'PhD']
#for i in range (2):
fig, ax =plt.subplots(1,2,figsize=(10,5))
for i in range (3):
plt.subplot(1,3,i+1)
sns.countplot(x='Education', data = df3_cat[df3_cat['HCLabels']==i], palette ='Set1', stat = 'percent', order = order)
plt.xticks(rotation=60, fontsize=8)
plt.title ('HCLabels = {}'.format(i))
plt.show()
Summary of each cluster:
First cluster:
Higher for: deals pruchase, Web purchases, time since enrollment
Lower for: recency
Intermediate for the following: products purchases, total offers no prefer channel.
Their profile is intermediate between clusters 1 and 2.
Second cluster:
Higher for: web visits per month
Lower for: number of purchases, total amount spent, all channels of purchases, time since enrollment
Their profile is younger people, lower income, higher family size.
Third cluster
Higher for: all types of products purchases, catalog and store purchases, recency, total spending.
Lower for: web visits per month, deals per month
Their profile is older people, higher income, less kids at home, smaller family. No impact of education or marital status.
Time since enrollment not different between clusters a bit higher for first cluster, lower for second cluster
DBSCAN is a very powerful algorithm for finding high-density clusters, but the problem is determining the best set of hyperparameters to use with it. It includes two hyperparameters, eps, and min samples.
Since it is an unsupervised algorithm, you have no control over it, unlike a supervised learning algorithm, which allows you to test your algorithm on a validation set. The approach we can follow is basically trying out a bunch of different combinations of values and finding the silhouette score for each of them.
dbscan_df = data_scaled_copy_pca.copy(deep=True)
dbscan_df1 = dbscan_df.copy(deep=True)
from sklearn.metrics import make_scorer
# Define the parameter grid
param_grid = {
'eps': np.arange(0.1, 4, 0.1),
'min_samples': np.arange(2, 11, 1)
}
# Create a custom scorer for silhouette score
silhouette_scorer = make_scorer(silhouette_score)
# Perform grid search
best_score = -1
best_params = None
for eps in param_grid['eps']:
for min_samples in param_grid['min_samples']:
dbscan = DBSCAN(eps=eps, min_samples=min_samples)
clusters = dbscan.fit_predict(dbscan_df)
if len(set(clusters)) > 1: # Silhouette score is only valid if there is more than one cluster
score = silhouette_score(dbscan_df, clusters)
if score > best_score:
best_score = score
best_params = {'eps': eps, 'min_samples': min_samples}
print('Best Silhouette Score: ', best_score)
print('Best Parameters: ', best_params)
Best Silhouette Score: 0.42973238124276447
Best Parameters: {'eps': 3.5000000000000004, 'min_samples': 4}
df5 = df_segm.copy()
dbscan = DBSCAN(eps=3.5, min_samples=4)
dbscan_df["db_segments"] = dbscan.fit_predict(dbscan_df)
# Add DBSCAN cluster labels to whole data
df5["db_segments"] = dbscan.fit_predict(dbscan_df1)
db_cluster_profile = df5.groupby("db_segments").mean(numeric_only = True)
db_cluster_profile['Total_elements_per_cluster'] = df5.groupby("db_segments")['Total_offers'].count().values
db_cluster_profile.style.highlight_max(color = "lightgreen", axis = 0)
| Income | Kidhome | Teenhome | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | Recency | Customer_age | Family_Size | Time_since_enrol | Total_spendings | Total_offers | Amount_per_purchase | Total_elements_per_cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| db_segments | ||||||||||||||||||||||
| -1 | 6835.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 13.000000 | 107.000000 | 2.000000 | 12.000000 | 2.000000 | 2.000000 | 12.000000 | 76.000000 | 67.000000 | 3.000000 | 11.490000 | 137.000000 | 0.000000 | 137.000000 | 1 |
| 0 | 51925.828882 | 0.444609 | 0.509397 | 2.232938 | 4.090504 | 2.633037 | 5.799703 | 5.301187 | 304.227992 | 21.506429 | 151.184965 | 31.975767 | 22.175631 | 39.350148 | 48.843719 | 55.089021 | 2.596439 | 10.900623 | 575.199369 | 0.454006 | 31.222493 | 2022 |
plt.figure(figsize=(8, 6))
sns.scatterplot(x=dbscan_df1.iloc[:,0], y=dbscan_df1.iloc[:,1], data=dbscan_df, hue = "db_segments")
plt.title('2 Component PCA')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid()
plt.show()
DBSCAN shows 2 clusters but the last one contains only 1 observations. Not a good technique
gmm_df = data_scaled_copy_pca.copy(deep=True)
gmm_df1 = gmm_df.copy(deep=True)
df6 = df_segm.copy(deep=True)
gmm_df
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2.142691 | 1.861015 | 1.150941 | -0.397170 | 0.187128 | -1.885127 | -1.108254 | -0.588348 |
| 1 | -1.858299 | -1.334782 | -0.292856 | 0.454520 | 0.412357 | -0.249644 | -0.248203 | 0.042585 |
| 2 | 1.158323 | -0.036337 | -1.050881 | -0.192170 | -1.505534 | 0.476404 | 0.146352 | 0.473095 |
| 3 | -1.747297 | -0.834835 | -0.412753 | 0.481666 | -0.105158 | 0.204336 | 0.051958 | -0.111660 |
| 4 | -0.255112 | 0.691678 | -1.447538 | 1.090458 | 0.843309 | -0.371551 | 0.010073 | 0.016424 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2018 | -1.614064 | 0.669353 | -0.315548 | 0.114864 | 0.760880 | -0.207679 | -0.162363 | 0.372103 |
| 2019 | 1.015851 | 0.573296 | -0.201157 | -0.052381 | -1.110435 | -1.558725 | 0.608450 | 0.627389 |
| 2020 | 1.471442 | -0.926131 | -0.086432 | 0.264990 | -0.342222 | 1.455138 | 1.081567 | -1.689674 |
| 2021 | 1.493209 | -0.572487 | -1.536689 | 0.253703 | -0.295094 | 0.164763 | -0.310308 | -0.146040 |
| 2022 | -1.004285 | 1.083098 | 1.159305 | -0.337881 | 0.453424 | 0.408686 | -0.104356 | 0.321399 |
2023 rows × 8 columns
#Defining best number of clusters
n_clusters = range(2, 11)
sil_scores = []
for n in n_clusters:
gmm = GaussianMixture(n, covariance_type='full', random_state=42).fit(gmm_df1)
labels = gmm.predict(gmm_df1)
sil_scores.append(silhouette_score(gmm_df1, labels))
plt.plot(n_clusters, sil_scores)
plt.xlabel('Number of clusters')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Score for GMM')
plt.show()
Silhouette score is the highest for 2 clusters. Second are 3 and 9.
# Let's apply Gaussian Mixture
gmm = GaussianMixture(n_components = 2, random_state = 1) # Initializing the Gaussian Mixture algorithm with n_components = 4
gmm.fit(gmm_df)
gmm_df["GMM_segments"] = gmm.predict(gmm_df)
df6["GMM_segments"] = gmm.predict(gmm_df1)
gmm_df
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | GMM_segments | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 2.142691 | 1.861015 | 1.150941 | -0.397170 | 0.187128 | -1.885127 | -1.108254 | -0.588348 | 1 |
| 1 | -1.858299 | -1.334782 | -0.292856 | 0.454520 | 0.412357 | -0.249644 | -0.248203 | 0.042585 | 0 |
| 2 | 1.158323 | -0.036337 | -1.050881 | -0.192170 | -1.505534 | 0.476404 | 0.146352 | 0.473095 | 1 |
| 3 | -1.747297 | -0.834835 | -0.412753 | 0.481666 | -0.105158 | 0.204336 | 0.051958 | -0.111660 | 0 |
| 4 | -0.255112 | 0.691678 | -1.447538 | 1.090458 | 0.843309 | -0.371551 | 0.010073 | 0.016424 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2018 | -1.614064 | 0.669353 | -0.315548 | 0.114864 | 0.760880 | -0.207679 | -0.162363 | 0.372103 | 0 |
| 2019 | 1.015851 | 0.573296 | -0.201157 | -0.052381 | -1.110435 | -1.558725 | 0.608450 | 0.627389 | 1 |
| 2020 | 1.471442 | -0.926131 | -0.086432 | 0.264990 | -0.342222 | 1.455138 | 1.081567 | -1.689674 | 1 |
| 2021 | 1.493209 | -0.572487 | -1.536689 | 0.253703 | -0.295094 | 0.164763 | -0.310308 | -0.146040 | 1 |
| 2022 | -1.004285 | 1.083098 | 1.159305 | -0.337881 | 0.453424 | 0.408686 | -0.104356 | 0.321399 | 0 |
2023 rows × 9 columns
plt.figure(figsize=(8, 6))
sns.scatterplot(x=gmm_df.iloc[:,0], y=gmm_df.iloc[:,1], data=gmm_df, hue = "GMM_segments")
plt.title('2 Component PCA')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.grid()
plt.show()
Observations and Insights:
2 clsuters. Some orange points are spread on the left on the graph, within the blue cluster
gmm_cluster_profile = df6.groupby("GMM_segments").mean(numeric_only = True)
gmm_cluster_profile
| Income | Kidhome | Teenhome | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | MntWines | MntFruits | ... | MntFishProducts | MntSweetProducts | MntGoldProds | Recency | Customer_age | Family_Size | Time_since_enrol | Total_spendings | Total_offers | Amount_per_purchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GMM_segments | |||||||||||||||||||||
| 0 | 36360.455217 | 0.768701 | 0.511811 | 2.285433 | 2.488189 | 0.616142 | 3.474409 | 6.400591 | 58.196850 | 5.604823 | ... | 8.060531 | 5.607160 | 17.617618 | 49.360236 | 53.085630 | 2.933071 | 10.838750 | 123.634719 | 0.175197 | 11.615512 |
| 1 | 67585.539722 | 0.117180 | 0.507448 | 2.177756 | 5.703078 | 4.665343 | 8.141013 | 4.199603 | 552.262165 | 37.530785 | ... | 56.074975 | 38.872145 | 61.249752 | 48.349553 | 57.122145 | 2.257200 | 10.963635 | 1030.364697 | 0.734856 | 51.109752 |
2 rows × 21 columns
gmm_cluster_profile.style.highlight_max(color = "lightgreen", axis = 0)
| Income | Kidhome | Teenhome | NumDealsPurchases | NumWebPurchases | NumCatalogPurchases | NumStorePurchases | NumWebVisitsMonth | MntWines | MntFruits | MntMeatProducts | MntFishProducts | MntSweetProducts | MntGoldProds | Recency | Customer_age | Family_Size | Time_since_enrol | Total_spendings | Total_offers | Amount_per_purchase | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GMM_segments | |||||||||||||||||||||
| 0 | 36360.455217 | 0.768701 | 0.511811 | 2.285433 | 2.488189 | 0.616142 | 3.474409 | 6.400591 | 58.196850 | 5.604823 | 28.540354 | 8.060531 | 5.607160 | 17.617618 | 49.360236 | 53.085630 | 2.933071 | 10.838750 | 123.634719 | 0.175197 | 11.615512 |
| 1 | 67585.539722 | 0.117180 | 0.507448 | 2.177756 | 5.703078 | 4.665343 | 8.141013 | 4.199603 | 552.262165 | 37.530785 | 274.787488 | 56.074975 | 38.872145 | 61.249752 | 48.349553 | 57.122145 | 2.257200 | 10.963635 | 1030.364697 | 0.734856 | 51.109752 |
df6_cat = df6[['Marital_Status', 'Education','Family_Size' ,"GMM_segments"]]
df6_cat.head()
| Marital_Status | Education | Family_Size | GMM_segments | |
|---|---|---|---|---|
| 0 | Single | Graduation | 1 | 1 |
| 1 | Single | Graduation | 3 | 0 |
| 2 | Together | Graduation | 2 | 1 |
| 3 | Together | Graduation | 3 | 0 |
| 4 | Married | PhD | 3 | 0 |
#fvisualisation of GMM clusters by cat values
fig, ax =plt.subplots(1,2,figsize=(10,5))
order = ['Single', 'Widow', 'Divorced', 'Married', 'Together']
for i in range (2):
plt.subplot(1,2,i+1)
sns.countplot(x='Marital_Status', data = df6_cat[df6_cat["GMM_segments"]==i], palette ='Set1', stat = 'percent', order = order)
plt.xticks(rotation=60, fontsize=8)
plt.title ('GMM_segments= {}'.format(i))
plt.show()
fig, ax =plt.subplots(1,2,figsize=(10,5))
for i in range (2):
plt.subplot(1,2,i+1)
sns.countplot(x='Family_Size', data = df6_cat[df6_cat["GMM_segments"]==i], palette ='Set1', stat = 'percent')
plt.xticks(rotation=60, fontsize=8)
plt.title ('GMM_segments = {}'.format(i))
plt.show()
order = ['Basic', 'Graduation', 'Master', 'PhD']
#for i in range (2):
fig, ax =plt.subplots(1,2,figsize=(10,5))
for i in range (2):
plt.subplot(1,2,i+1)
sns.countplot(x='Education', data = df6_cat[df6_cat["GMM_segments"]==i], palette ='Set1', stat = 'percent', order = order)
plt.xticks(rotation=60, fontsize=8)
plt.title ('GMM_segments = {}'.format(i))
plt.show()
Observations and Insights:
Summary of each cluster:
First cluster: buy less products, no prefered channel, lower amount spent, higher visits per months and take more deals, take on less offers. Recency is higher. Their profile is younger people, lower income, more kids at home, bigger family. No impact of education or marital status.
Second cluster: buy more products, through all channels, higher amount spent, fewer visits per months, take on more offers, and less deals. Their profile is older people, higher income, less kids at home, smaller family. No impact of education or marital status.
Time since enrollment not different between clusters a bit higher for second cluster
This profile is comparable to the KMeans clusters profiles
1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):
kmeans = KMeans(n_clusters = 2, random_state = 1, n_init = 'auto') # Initializing K-Means with number of clusters as 4 and random_state=1
preds = kmeans.fit_predict((data_scaled_copy_pca)) # Fitting and predicting K-Means on data_pca
score = silhouette_score(data_scaled_copy_pca, preds) # Calculating the silhouette score
print(score)
0.3073648203227294
kmedoids = KMedoids(n_clusters = 2, random_state = 1) # Initializing K-Medoids with number of clusters as 4 and random_state=1
preds = kmedoids.fit_predict((data_scaled_copy_pca)) # Fitting and predicting K-Medoids on data_pca
score = silhouette_score(data_scaled_copy_pca, preds) # Calculating the silhouette score
print(score)
0.289221596360021
# Initializing Agglomerative Clustering with distance as Euclidean, linkage as ward with clusters = 4
HCmodel = AgglomerativeClustering(n_clusters = 3, metric = "cosine", linkage = "average")
# Fitting on PCA data
preds = HCmodel.fit_predict(data_scaled_copy_pca)
score = silhouette_score(data_scaled_copy_pca, preds) # Calculating the silhouette score
print(score)
0.2657754498010657
# Initializing Gaussian Mixture algorithm with number of clusters as 2 and random_state = 1
gmm = GaussianMixture(n_components=2, random_state=1)
# Fitting and predicting Gaussian Mixture algorithm on data_pca
preds = gmm.fit_predict((data_scaled_copy_pca))
# Calculating the silhouette score
score = silhouette_score(data_scaled_copy_pca, preds)
# Printing the score
print(score)
0.288121070248977
2. Refined insights:
3. Proposal for the final solution design:
Based on the silhouette score, we can see that K-Means algorithm with 2 clusters is giving the best score on the data. We would proceed K-Means with 2 clusters as the best algorithm.
We have identified 2 clusters:
People who spend lots of money to buy al kinds of products, especially expensive products such as Wine or meat. They reach the store though all channels. They visit the website less often, and everytime they buy a produt they spend a higher amount of money per purchase than the other cluster. These people are older, higher income, smaller family size. Their level of education or marital status does not seem to impact their purchase behavior. They started buying products a longer time ago.
Second cluster refer to people who buy less products, and spend less money each time they buy. They visit the website more often maybe to look for deals, as they respond more to deals. They do not seem to have a prefered channel. These people are younger, they have a bigger family and a lower income.